• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, March 27, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

A Coding Implementation to Run Qwen3.5 Reasoning Models Distilled with Claude-Style Thinking Using GGUF and 4-Bit Quantization

Josh by Josh
March 26, 2026
in Al, Analytics and Automation
0
A Coding Implementation to Run Qwen3.5 Reasoning Models Distilled with Claude-Style Thinking Using GGUF and 4-Bit Quantization


In this tutorial, we work directly with Qwen3.5 models distilled with Claude-style reasoning and set up a Colab pipeline that lets us switch between a 27B GGUF variant and a lightweight 2B 4-bit version with a single flag. We start by validating GPU availability, then conditionally install either llama.cpp or transformers with bitsandbytes, depending on the selected path. Both branches are unified through shared generate_fn and stream_fn interfaces, ensuring consistent inference across backends. We also implement a ChatSession class for multi-turn interaction and build utilities to parse <think> traces, allowing us to explicitly separate reasoning from final outputs during execution.

MODEL_PATH = "2B_HF"


import torch


if not torch.cuda.is_available():
   raise RuntimeError(
       "❌ No GPU! Go to Runtime → Change runtime type → T4 GPU."
   )


gpu_name = torch.cuda.get_device_name(0)
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"✅ GPU: {gpu_name} — {vram_gb:.1f} GB VRAM")


import subprocess, sys, os, re, time


generate_fn = None
stream_fn = None

We initialize the execution by setting the model path flag and checking whether a GPU is available on the system. We retrieve and print the GPU name along with available VRAM to ensure the environment meets the requirements. We also import all required base libraries and define placeholders for the unified generation functions that will be assigned later.

READ ALSO

Seeing sounds | MIT News

AI system learns to keep warehouse robot traffic running smoothly | MIT News

if MODEL_PATH == "27B_GGUF":
   print("\n📦 Installing llama-cpp-python with CUDA (takes 3-5 min)...")
   env = os.environ.copy()
   env["CMAKE_ARGS"] = "-DGGML_CUDA=on"
   subprocess.check_call(
       [sys.executable, "-m", "pip", "install", "-q", "llama-cpp-python", "huggingface_hub"],
       env=env,
   )
   print("✅ Installed.\n")


   from huggingface_hub import hf_hub_download
   from llama_cpp import Llama


   GGUF_REPO = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF"
   GGUF_FILE = "Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-Q4_K_M.gguf"


   print(f"⏳ Downloading {GGUF_FILE} (~16.5 GB)... grab a coffee ☕")
   model_path = hf_hub_download(repo_id=GGUF_REPO, filename=GGUF_FILE)
   print(f"✅ Downloaded: {model_path}\n")


   print("⏳ Loading into llama.cpp (GPU offload)...")
   llm = Llama(
       model_path=model_path,
       n_ctx=8192,
       n_gpu_layers=40,
       n_threads=4,
       verbose=False,
   )
   print("✅ 27B GGUF model loaded!\n")


   def generate_fn(
       prompt, system_prompt="You are a helpful assistant. Think step by step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95, **kwargs
   ):
       output = llm.create_chat_completion(
           messages=[
               {"role": "system", "content": system_prompt},
               {"role": "user", "content": prompt},
           ],
           max_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p,
       )
       return output["choices"][0]["message"]["content"]


   def stream_fn(
       prompt, system_prompt="You are a helpful assistant. Think step by step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
   ):
       print("⏳ Streaming output:\n")
       for chunk in llm.create_chat_completion(
           messages=[
               {"role": "system", "content": system_prompt},
               {"role": "user", "content": prompt},
           ],
           max_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p,
           stream=True,
       ):
           delta = chunk["choices"][0].get("delta", {})
           text = delta.get("content", "")
           if text:
               print(text, end="", flush=True)
       print()


   class ChatSession:
       def __init__(self, system_prompt="You are a helpful assistant. Think step by step."):
           self.messages = [{"role": "system", "content": system_prompt}]
       def chat(self, user_message, temperature=0.6):
           self.messages.append({"role": "user", "content": user_message})
           output = llm.create_chat_completion(
               messages=self.messages, max_tokens=2048,
               temperature=temperature, top_p=0.95,
           )
           resp = output["choices"][0]["message"]["content"]
           self.messages.append({"role": "assistant", "content": resp})
           return resp

We handle the 27B GGUF path by installing llama.cpp with CUDA support and downloading the Qwen3.5 27B distilled model from Hugging Face. We load the model with GPU offloading and define a standardized generate_fn and stream_fn for inference and streaming outputs. We also implement a ChatSession class to maintain conversation history for multi-turn interactions.

elif MODEL_PATH == "2B_HF":
   print("\n📦 Installing transformers + bitsandbytes...")
   subprocess.check_call([
       sys.executable, "-m", "pip", "install", "-q",
       "transformers @ git+https://github.com/huggingface/transformers.git@main",
       "accelerate", "bitsandbytes", "sentencepiece", "protobuf",
   ])
   print("✅ Installed.\n")


   from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer


   HF_MODEL_ID = "Jackrong/Qwen3.5-2B-Claude-4.6-Opus-Reasoning-Distilled"


   bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=torch.bfloat16,
       bnb_4bit_use_double_quant=True,
   )


   print(f"⏳ Loading {HF_MODEL_ID} in 4-bit...")
   tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ID, trust_remote_code=True)
   model = AutoModelForCausalLM.from_pretrained(
       HF_MODEL_ID,
       quantization_config=bnb_config,
       device_map="auto",
       trust_remote_code=True,
       torch_dtype=torch.bfloat16,
   )
   print(f"✅ Model loaded! Memory: {model.get_memory_footprint() / 1e9:.2f} GB\n")


   def generate_fn(
       prompt, system_prompt="You are a helpful assistant. Think step by step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
       repetition_penalty=1.05, do_sample=True, **kwargs
   ):
       messages = [
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": prompt},
       ]
       text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
       inputs = tokenizer(text, return_tensors="pt").to(model.device)
       with torch.no_grad():
           output_ids = model.generate(
               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,
               top_p=top_p, repetition_penalty=repetition_penalty, do_sample=do_sample,
           )
       generated = output_ids[0][inputs["input_ids"].shape[1]:]
       return tokenizer.decode(generated, skip_special_tokens=True)


   def stream_fn(
       prompt, system_prompt="You are a helpful assistant. Think step by step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
   ):
       messages = [
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": prompt},
       ]
       text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
       inputs = tokenizer(text, return_tensors="pt").to(model.device)
       streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
       print("⏳ Streaming output:\n")
       with torch.no_grad():
           model.generate(
               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,
               top_p=top_p, do_sample=True, streamer=streamer,
           )


   class ChatSession:
       def __init__(self, system_prompt="You are a helpful assistant. Think step by step."):
           self.messages = [{"role": "system", "content": system_prompt}]
       def chat(self, user_message, temperature=0.6):
           self.messages.append({"role": "user", "content": user_message})
           text = tokenizer.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
           inputs = tokenizer(text, return_tensors="pt").to(model.device)
           with torch.no_grad():
               output_ids = model.generate(
                   **inputs, max_new_tokens=2048, temperature=temperature, top_p=0.95, do_sample=True,
               )
           generated = output_ids[0][inputs["input_ids"].shape[1]:]
           resp = tokenizer.decode(generated, skip_special_tokens=True)
           self.messages.append({"role": "assistant", "content": resp})
           return resp
else:
   raise ValueError("MODEL_PATH must be '27B_GGUF' or '2B_HF'")

We implement the lightweight 2B path using transformers with 4-bit quantization through bitsandbytes. We load the Qwen3.5 2B distilled model efficiently onto the GPU and configure generation parameters for controlled sampling. We again define unified generation, streaming, and chat session logic so that both model paths behave identically during execution.

def parse_thinking(response: str) -> tuple:
   m = re.search(r"<think>(.*?)</think>", response, re.DOTALL)
   if m:
       return m.group(1).strip(), response[m.end():].strip()
   return "", response.strip()




def display_response(response: str):
   thinking, answer = parse_thinking(response)
   if thinking:
       print("🧠 THINKING:")
       print("-" * 60)
       print(thinking[:1500] + ("\n... [truncated]" if len(thinking) > 1500 else ""))
       print("-" * 60)
   print("\n💬 ANSWER:")
   print(answer)




print("✅ All helpers ready. Running tests...\n")

We define helper functions to extract reasoning traces enclosed within <think> tags and separate them from final answers. We create a display utility that formats and prints both the thinking process and the response in a structured way. This allows us to inspect how the Qwen-based model reasons internally during generation.

print("=" * 70)
print("📝 TEST 1: Basic reasoning")
print("=" * 70)


response = generate_fn(
   "If I have 3 apples and give away half, then buy 5 more, how many do I have? "
   "Explain your reasoning."
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 2: Streaming output")
print("=" * 70)


stream_fn(
   "Explain the difference between concurrency and parallelism. "
   "Give a real-world analogy for each."
)


print("\n" + "=" * 70)
print("📝 TEST 3: Thinking ON vs OFF")
print("=" * 70)


question = "What is the capital of France?"


print("\n--- Thinking ON (default) ---")
resp = generate_fn(question)
display_response(resp)


print("\n--- Thinking OFF (concise) ---")
resp = generate_fn(
   question,
   system_prompt="Answer directly and concisely. Do not use <think> tags.",
   max_new_tokens=256,
)
display_response(resp)


print("\n" + "=" * 70)
print("📝 TEST 4: Bat & ball trick question")
print("=" * 70)


response = generate_fn(
   "A bat and a ball cost $1.10 in total. "
   "How much does the ball cost? Show complete reasoning and verify.",
   system_prompt="You are a precise mathematical reasoner. Set up equations and verify.",
   temperature=0.3,
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 5: Train meeting problem")
print("=" * 70)


response = generate_fn(
   "A train leaves Station A at 9:00 AM at 60 mph toward Station B. "
   "Another leaves Station B at 10:00 AM at 80 mph toward Station A. "
   "Stations are 280 miles apart. When and where do they meet?",
   temperature=0.3,
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 6: Logic puzzle (five houses)")
print("=" * 70)


response = generate_fn(
   "Five houses in a row are painted different colors. "
   "The red house is left of the blue house. "
   "The green house is in the middle. "
   "The yellow house is not next to the blue house. "
   "The white house is at one end. "
   "What is the order from left to right?",
   temperature=0.3,
   max_new_tokens=3000,
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 7: Code generation — longest palindromic substring")
print("=" * 70)


response = generate_fn(
   "Write a Python function to find the longest palindromic substring "
   "using Manacher's algorithm. Include docstring, type hints, and tests.",
   system_prompt="You are an expert Python programmer. Think through the algorithm carefully.",
   max_new_tokens=3000,
   temperature=0.3,
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 8: Multi-turn conversation (physics tutor)")
print("=" * 70)


session = ChatSession(
   system_prompt="You are a knowledgeable physics tutor. Explain clearly with examples."
)


turns = [
   "What is the Heisenberg uncertainty principle?",
   "Can you give me a concrete example with actual numbers?",
   "How does this relate to quantum tunneling?",
]


for i, q in enumerate(turns, 1):
   print(f"\n{'─'*60}")
   print(f"👤 Turn {i}: {q}")
   print(f"{'─'*60}")
   resp = session.chat(q, temperature=0.5)
   _, answer = parse_thinking(resp)
   print(f"🤖 {answer[:1000]}{'...' if len(answer) > 1000 else ''}")


print("\n" + "=" * 70)
print("📝 TEST 9: Temperature comparison — creative writing")
print("=" * 70)


creative_prompt = "Write a one-paragraph opening for a sci-fi story about AI consciousness."


configs = [
   {"label": "Low temp (0.1)",  "temperature": 0.1, "top_p": 0.9},
   {"label": "Med temp (0.6)",  "temperature": 0.6, "top_p": 0.95},
   {"label": "High temp (1.0)", "temperature": 1.0, "top_p": 0.98},
]


for cfg in configs:
   print(f"\n🎛️  {cfg['label']}")
   print("-" * 60)
   start = time.time()
   resp = generate_fn(
       creative_prompt,
       system_prompt="You are a creative fiction writer.",
       max_new_tokens=512,
       temperature=cfg["temperature"],
       top_p=cfg["top_p"],
   )
   elapsed = time.time() - start
   _, answer = parse_thinking(resp)
   print(answer[:600])
   print(f"⏱️  {elapsed:.1f}s")


print("\n" + "=" * 70)
print("📝 TEST 10: Speed benchmark")
print("=" * 70)


start = time.time()
resp = generate_fn(
   "Explain how a neural network learns, step by step, for a beginner.",
   system_prompt="You are a patient, clear teacher.",
   max_new_tokens=1024,
)
elapsed = time.time() - start


approx_tokens = int(len(resp.split()) * 1.3)
print(f"~{approx_tokens} tokens in {elapsed:.1f}s")
print(f"~{approx_tokens / elapsed:.1f} tokens/sec")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Peak VRAM: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")


import gc


for name in ["model", "llm"]:
   if name in globals():
       del globals()[name]
gc.collect()
torch.cuda.empty_cache()


print(f"\n✅ Memory freed. VRAM: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print("\n" + "=" * 70)
print("🎉 Tutorial complete!")
print("=" * 70)

We run a comprehensive test suite that evaluates the model across reasoning, streaming, logic puzzles, code generation, and multi-turn conversations. We compare outputs under different temperature settings and measure performance in terms of speed and token throughput. Finally, we clean up memory and free GPU resources, ensuring the notebook remains reusable for further experiments.

In conclusion, we have a compact but flexible setup for running Qwen3.5-based reasoning models enhanced with Claude-style distillation across different hardware constraints. The script abstracts backend differences while exposing consistent generation, streaming, and conversational interfaces, making it easy to experiment with reasoning behavior. Through the test suite, we probe how the model handles structured reasoning, edge-case questions, and longer multi-step tasks, while also measuring speed and memory usage. What we end up with is not just a demo, but a reusable scaffold for evaluating and extending Qwen-based reasoning systems in Colab without changing the core code.


Check out the Full Notebook and Source Page. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




Source_link

Related Posts

Seeing sounds | MIT News
Al, Analytics and Automation

Seeing sounds | MIT News

March 27, 2026
AI system learns to keep warehouse robot traffic running smoothly | MIT News
Al, Analytics and Automation

AI system learns to keep warehouse robot traffic running smoothly | MIT News

March 26, 2026
How to Build a Vision-Guided Web AI Agent with MolmoWeb-4B Using Multimodal Reasoning and Action Prediction
Al, Analytics and Automation

How to Build a Vision-Guided Web AI Agent with MolmoWeb-4B Using Multimodal Reasoning and Action Prediction

March 26, 2026
Apple Is Finally Rebuilding Siri From the Ground Up. But Will It Be Any Good This Time?
Al, Analytics and Automation

Apple Is Finally Rebuilding Siri From the Ground Up. But Will It Be Any Good This Time?

March 25, 2026
Wristband enables wearers to control a robotic hand with their own movements | MIT News
Al, Analytics and Automation

Wristband enables wearers to control a robotic hand with their own movements | MIT News

March 25, 2026
NVIDIA AI Introduces PivotRL: A New AI Framework Achieving High Agentic Accuracy With 4x Fewer Rollout Turns Efficiently
Al, Analytics and Automation

NVIDIA AI Introduces PivotRL: A New AI Framework Achieving High Agentic Accuracy With 4x Fewer Rollout Turns Efficiently

March 25, 2026
Next Post
Wikipedia cracks down on the use of AI in article writing

Wikipedia cracks down on the use of AI in article writing

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Nvidia chips: Trump handed China a major advantage on AI

Nvidia chips: Trump handed China a major advantage on AI

July 17, 2025

Advanced Pinterest SEO & Keyword Strategy (2025 Edition)

October 11, 2025

Crafting Compelling Emails to Engage Your Audience Pre-Event

June 10, 2025
Kellogg’s Cereal: A Story Of Missed Strategic Opportunities

Kellogg’s Cereal: A Story Of Missed Strategic Opportunities

July 16, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • What Are Pillar Pages in SEO & How To Create Them Step-By-Step
  • How to Gather Evidence After a Bicycle Accident for a Strong Case
  • Anthropic Supply-Chain-Risk Designation Halted by Judge
  • Experts Discuss Measurement and Brand Safety Risks
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions