5 Steps to Train Qwen-3 LLM Scratch

Have you ever wondered what truly goes on under the hood of a large language model (LLM)? Beyond simply using pre-trained models, imagine the profound understanding and unparalleled control you’d gain if you could Train Qwen-3 LLM scratch. This isn’t just about running an existing model; it’s about building intelligence from the ground up, one line of code at a time. This article will serve as your comprehensive guide to doing just that, transforming you from an LLM user into an LLM architect.

This deep dive is inspired by the insightful tutorial from Vuk Rosić, which provides the foundational knowledge to truly master LLM development. You can follow along with the original content here: LLM from Scratch Tutorial – Code & Train Qwen 3.

Table of Contents

Why Embark on the Journey to Train Qwen-3 LLM Scratch?

Qwen-3, developed by Alibaba Cloud’s Qwen team, stands out for its advanced reasoning capabilities, robust multilingual support, and innovative hybrid thinking and non-thinking modes. While readily available pre-trained models are powerful, there are compelling reasons to take on the challenge of developing and training an LLM from its core:

Unfiltered Machine Learning Mastery: Building an LLM from scratch offers an unrivaled, raw understanding of how gradients flow, how models learn, and how artificial intelligence truly comes to life. You’ll gain a mastery that goes beyond high-level abstractions, empowering you to debug, optimize, and innovate with confidence.
Deep Architectural Insight: This journey isn’t just about coding; it’s about dissecting Qwen-3’s sophisticated architecture. You’ll grasp the intuition behind every component, from attention mechanisms to feed-forward networks, and how they collectively enable complex language understanding and generation.
Tailored Customization: When you train Qwen-3 LLM scratch, you gain complete control over its entire lifecycle. This allows for unparalleled customization, enabling you to fine-tune every aspect of the model to suit specific domains, niche applications, or unique dataset characteristics. Imagine building an LLM perfectly optimized for medical research, legal document analysis, or even creative writing styles unique to your vision.
Enhanced Efficiency and Optimization: Understanding the underlying mechanisms allows for profound optimization. By grasping how elements like Grouped Query Attention (GQA) or the Muon Optimizer contribute to efficiency, you can make informed decisions to reduce memory footprint, accelerate training times, and improve inference speeds – crucial for deploying LLMs in resource-constrained environments.
Innovation and Contribution: With a fundamental understanding, you’re not just a consumer of AI technology, but a potential innovator. You can experiment with novel architectures, propose new optimization techniques, or contribute to the open-source LLM community, pushing the boundaries of what’s possible in AI.

This tutorial will guide you through the intricate details of Qwen-3’s architecture and implementation, ensuring that by the end, you possess a comprehensive understanding of these advanced models and the skills to train Qwen-3 LLM scratch.

Unpacking Qwen-3’s Cutting-Edge Features

Before diving into the code, let’s explore some of the specific innovations that make Qwen-3 unique and highly efficient. Understanding these features is key to effectively configuring and training your model.

Grouped Query Attention (GQA)

One of Qwen-3’s standout features is its implementation of Grouped Query Attention. In traditional Multi-Head Attention, each query head has its own independent key and value heads. GQA, however, introduces an efficiency trick: it shares key and value heads across multiple query heads.

For example, if you have 8 query heads and 4 key/value heads, each key/value head will be “repeated” twice, so two query heads will attend to the same key/value pair. This significantly reduces the size of the KV cache memory, which stores the keys and values for previously processed tokens. A smaller KV cache means less memory consumption, enabling the model to process longer sequences or larger batch sizes, which are critical for effective LLM training and inference.

The Revolutionary Muon Optimizer

The Muon Optimizer is a powerful innovation that sets Qwen-3 apart, especially when you train Qwen-3 LLM scratch. Traditional optimizers like AdamW can sometimes struggle with stability or convergence speed, particularly when dealing with large learning rates or high-dimensional weight matrices.

Muon addresses these challenges by orthonormalizing update matrices. What does this mean in simpler terms? Imagine your neural network weights as matrices that stretch, rotate, and reflect input data. Often, “stretching” (amplifying certain dimensions) occurs not because it’s beneficial for reducing loss, but because some input numbers are arbitrarily large, leading to disproportionately large weight updates.

The Muon Optimizer aims to prevent this unwanted stretching. It transforms weight updates to primarily rotate vectors rather than stretch them. This is achieved through a mathematical process involving the Newton-Schulz iteration, which approximates a function that orthogonalizes any given matrix. An orthogonal matrix is one that preserves the magnitude of vectors, only rotating them.

Practical Benefits of Muon:

Faster Convergence: Muon can often converge faster, allowing models to learn effectively with less training data or fewer steps.
Higher Learning Rates: Its inherent stability allows for the use of higher learning rates, further accelerating the training process.
Improved Stability: By preventing arbitrary weight amplification, Muon contributes to more stable training, reducing the likelihood of vanishing or exploding gradients.

While the mathematics can be complex, the core idea is to normalize the updates to ensure that changes to weights are primarily driven by the learning objective, not by quirks in input data scaling. For a deeper dive into the Muon Optimizer, explore detailed explanations on the developer’s channel, such as this video on orthonormal matrices (replace with actual link if available from context, otherwise use a generic conceptual video on matrix rotations/orthogonality).

Rotary Positional Embeddings (RoPE)

Unlike traditional positional embeddings that might add a fixed vector to token embeddings, Qwen-3 leverages Rotary Positional Embeddings (RoPE). RoPE encodes positional information by applying a rotation to the query and key vectors of each token. The degree of rotation depends on the token’s position in the sequence.

This method allows the model to inherently understand the relative distance between tokens. For example, it can learn that the 100th token is further from the 10th token but closer to the 95th token, based on the specific rotation applied to their query and key vectors. RoPE is known for its ability to generalize well to longer sequences than seen during training and is a key component for how Qwen-3 manages sequence context. You can learn more about the theoretical underpinnings of RoPE through resources like this introductory video (replace with actual link if available from context, otherwise use a generic conceptual video on RoPE).

SwiGLU Activation Function

Within the Feed-Forward Networks of Qwen-3, the SwiGLU activation function is employed. SwiGLU is a variant of the Gated Linear Unit (GLU) and the Swish activation function. It works by having two parallel linear projections, one of which is passed through a SiLU (Sigmoid Linear Unit) activation, acting as a “gate.” The output of this gate then multiplies the output of the other linear projection.

This gating mechanism provides an additional level of control, allowing the network to dynamically activate or deactivate certain neurons in the hidden layer based on the input. It effectively acts as a “brightness control” for the information flowing through the network, suppressing less relevant features and amplifying more important ones, leading to more expressive and powerful representations.

RMSNorm (Root Mean Square Normalization)

Qwen-3 utilizes RMSNorm instead of the more common LayerNorm. RMSNorm simplifies the normalization process by only scaling the activations by their root mean square, without recentering. It’s computationally more efficient than LayerNorm and can provide similar or even better performance in many scenarios. RMSNorm ensures stability during training by preventing activations from becoming too large or too small, which can lead to unstable gradients.

The Journey Begins: Setting Up Your Environment

To successfully train Qwen-3 LLM scratch, you’ll need a suitable environment and a grasp of foundational concepts.

Prerequisites for Your Training Endeavor

Before you begin coding, ensure you have:

Basic Understanding of Attention Mechanisms: If terms like “queries,” “keys,” and “values” are new, it’s highly recommended to first familiarize yourself with the basics of self-attention. Resources like this Llama 4 tutorial on self-attention (replace with actual link if available from context, otherwise a general self-attention video) can provide the necessary foundation.
Familiarity with Tokenizers: Understanding how text is converted into numerical tokens is crucial.
Python and PyTorch Knowledge: The implementation relies heavily on PyTorch.
A GPU Environment: Training LLMs is computationally intensive. Google Colab’s free GPU (typically a T4) is sufficient for initial testing and small models, but for serious training, a more powerful GPU is highly recommended.

Environment Setup in Google Colab

Select GPU Runtime: In Google Colab, go to Runtime > Change runtime type and select GPU as the hardware accelerator.

Initial Imports: Begin with essential libraries. The full list will be comprehensive, but here are some key ones:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import random
from datasets import load_dataset
from transformers import AutoTokenizer
from torch.optim import AdamW # For non-2D matrices
from torch.utils.data import Dataset, DataLoader
import os
import tqdm # For progress bar

Note: When prompted by Hugging Face for a token during dataset loading, you can usually press Cancel as many datasets are public.

Ensure Reproducibility with Random Seeds: To compare different training runs effectively, it’s vital to control randomness. Set seeds for torch, numpy, and random:

torch.manual_seed(1337)
np.random.seed(1337)
random.seed(1337)
# Further CUDA seed for GPU operations (optional, as some GPU randomness is unavoidable)
torch.cuda.manual_seed_all(1337)

Configuring Your Qwen-3 Model

The config dictionary is the heart of your model’s architecture, defining its size, capacity, and various parameters. When you train Qwen-3 LLM scratch, these settings dictate the computational demands and potential performance.

config = {
    'model_embedding_dim': 768,       # Inner dimension of the model (token embedding vector dimension)
    'num_heads': 8,                   # Number of heads in the self-attention mechanism
    'num_decoder_layers': 6,          # Number of decoder layers (stack of attention and feed-forward layers)
    'feed_forward_dim': 3072,         # Inner dimension of the feed-forward network (typically 4x model_embedding_dim)
    'batch_size': 24,                 # Batch size for training (adjust based on GPU memory)
    'max_steps': 2000,                # Maximum training steps (start small, increase significantly for real training)
    'num_key_value_heads': 4,         # Number of key and value heads for Grouped Query Attention (e.g., num_heads / 2)
    'sliding_window': 10000,          # Sliding window attention length (set large to effectively disable for short sequences)
    'attention_bias': False,          # Whether to use bias in QKV linear layers (often False in transformers)
    'rms_norm_eps': 1e-6,             # Epsilon value for RMSNorm to prevent division by zero
    'gradient_accumulation_steps': 4, # Simulate larger batch sizes if GPU memory is limited
    'muon_learning_rate': 3e-4,       # Learning rate for the Muon optimizer (can be higher than AdamW)
    'max_sequence_length': 512,       # Maximum sequence length for input tokens (recommend powers of 2)
    'documents_downloaded': 1000,     # Number of documents to download for initial data
    'weight_decay': 0.1,              # L2 regularization to prevent overfitting
    'dropout': 0.1,                   # Dropout rate for regularization
    'gradient_clipping': 1.0,         # Clip gradients to prevent exploding gradients
    'eval_interval': 100,             # How often to evaluate the model during training
    'vocab_size': 30522               # Placeholder - will be updated by tokenizer.vocab_size
}

Practical Hyperparameter Tuning Tips:

max_sequence_length and batch_size: These are the most impactful parameters for training performance and memory usage. Try to increase them as much as your GPU memory allows. Powers of two are often recommended for performance.
max_steps: For initial testing, 2000 steps might yield some legible output, but for a genuinely capable model, you’ll need tens of thousands, if not hundreds of thousands, of steps, which can take hours or days on consumer GPUs.
gradient_accumulation_steps: If your GPU cannot handle a large batch_size directly, use gradient_accumulation_steps. This processes multiple mini-batches, accumulates their gradients, and then updates weights once, effectively simulating a larger batch size without demanding more VRAM for a single forward pass. While not as fast as a true large batch size, it’s an excellent workaround.

Building the Qwen-3 Architecture: Step-by-Step Components

Now, let’s construct the fundamental building blocks of your Qwen-3 LLM.

1. Grouped Query Attention (GQA) Implementation

The repeat_kv_heads function is crucial for GQA, ensuring that key and value heads are duplicated to match the query heads.

def repeat_kv_heads(x: torch.Tensor, repetitions: int) -> torch.Tensor:
    """
    Repeats the key/value heads along dimension 2 (num_key_value_heads)
    to match the number of query heads. This is essential for Grouped Query Attention.
    """
    batch_size, num_key_value_heads, sequence_length, head_dim = x.shape
    # Expand along a new dimension, then reshape to interleave the repetitions
    x = x[:, :, None, :, :].expand(batch_size, num_key_value_heads, repetitions, sequence_length, head_dim)
    x = x.reshape(batch_size, num_key_value_heads * repetitions, sequence_length, head_dim)
    return x

This function takes the (batch_size, num_key_value_heads, sequence_length, head_dim) tensor and expands it to (batch_size, num_heads, sequence_length, head_dim), enabling the query heads to interact with the shared key/value heads.

2. The Powerful Muon Optimizer Class

Implementing the Muon optimizer directly involves handling its unique update rule, which focuses on orthogonalizing the weight updates. This is a crucial step when you aim to train Qwen-3 LLM scratch with optimal efficiency.

class Muon(Optimizer):
    def __init__(self, params, lr=1e-3, momentum=0, nesterov=False):
        if lr <= 0: raise ValueError(f"Invalid learning rate: {lr}")
        if momentum < 0.0: raise ValueError(f"Invalid momentum value: {momentum}")
        defaults = dict(lr=lr, momentum=momentum, nesterov=nesterov)
        if nesterov and (momentum <= 0 or momentum > 1):
            raise ValueError("Nesterov momentum requires a momentum and dampening of between 0 and 1")
        super(Muon, self).__init__(params, defaults)

    def __setstate__(self, state):
        super(Muon, self).__setstate__(state)
        for group in self.param_groups:
            group.setdefault('nesterov', False)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            lr = group['lr']
            for p in group['params']:
                if p.grad is None: continue

                d_p = p.grad
                if d_p.is_sparse:
                    raise RuntimeError('Muon optimizer does not support sparse gradients')

                state = self.state[p]

                if group['momentum'] != 0:
                    if 'momentum_buffer' not in state:
                        buf = state['momentum_buffer'] = torch.clone(d_p).detach()
                    else:
                        buf = state['momentum_buffer']
                        buf.mul_(group['momentum']).add_(d_p)
                    if group['nesterov']:
                        d_p = d_p.add(buf, alpha=group['momentum'])
                    else:
                        d_p = buf

                update = d_p

                # Muon's core logic for 2D matrices: orthogonalization
                if len(p.shape) == 2: # Applies to weights of linear layers
                    I = torch.eye(p.shape[1], device=p.device)
                    XTX = torch.matmul(update.T, update) # (update.T * update)
                    A = XTX + I # Add identity matrix for numerical stability
                    XTX_n = torch.linalg.solve(A, XTX) # Solve A*X = B, effectively X=inv(A)*B
                    update = update - torch.matmul(update, XTX_n) # Subtract the non-orthogonal component

                p.add_(update, alpha=-lr) # Apply the orthogonalized update

        return loss

The key insight here is the if len(p.shape) == 2: block, which performs the orthogonalization. This ensures that the weight updates primarily rotate, rather than stretch, the vectors, leading to the benefits discussed earlier.

3. Data Loading and Preparation Pipeline

Efficient data handling is paramount. We’ll use Hugging Face datasets and transformers for tokenization.

# Initialize tokenizer (e.g., "bert-base-uncased" is a good general choice)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config['vocab_size'] = tokenizer.vocab_size # Update vocab_size in config

# Load a suitable dataset (TinyStories is great for small LLMs)
num_documents_to_download = config['documents_downloaded']
dataset = load_dataset("roneneldan/TinyStories", split='train')

text_data = ""
# Concatenate a portion of each document to keep diversity
for i in range(min(num_documents_to_download, len(dataset))):
    text_data += dataset[i]['text'][:3000] # Take first 3000 chars to maintain diversity

# Tokenize the entire corpus
print(f"Tokenizing {len(text_data)} characters...")
tokens = tokenizer.encode(text_data)
print(f"Total tokens: {len(tokens)}")

class TextDataset(Dataset):
    def __init__(self, tokens, sequence_length):
        self.tokens = tokens
        self.sequence_length = sequence_length

    def __len__(self):
        # Ensure we have enough tokens for one sequence and its shifted label
        return len(self.tokens) - self.sequence_length - 1

    def __getitem__(self, idx):
        # Input sequence (X)
        inputs = torch.tensor(self.tokens[idx : idx + self.sequence_length], dtype=torch.long)
        # Target sequence (Y), shifted by one token for next-token prediction
        labels = torch.tensor(self.tokens[idx+1 : idx + self.sequence_length+1], dtype=torch.long)
        return inputs, labels

# Create the dataset instance
full_dataset = TextDataset(tokens, config['max_sequence_length'])

This prepares your text data by tokenizing it and structuring it into input-output pairs suitable for next-token prediction training.

4. Rotary Positional Embeddings (RoPE) Implementation

RoPE is applied to query and key vectors to inject positional information through rotation.

def rotate_half(x):
    """Rotates half the hidden dims from x."""
    x1 = x[..., :x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2:]
    return torch.cat((-x2, x1), dim=-1)

# @torch.jit.script # Optional: for potential performance boost with JIT compilation
def apply_rotary_pos_emb(q, k, cos, sin):
    # Apply cosine and sine rotations
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

class RotaryPositionalEmbedding(nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000):
        super().__init__()
        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer("inv_freq", inv_freq)

        # Build rotary matrix beforehand for efficiency
        self._set_cos_sin_cache(max_position_embeddings)

    def _set_cos_sin_cache(self, seq_len):
        self.max_seq_len_cached = seq_len
        t = torch.arange(seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        # Different from paper, but used in Qwen-3:
        # Instead of `torch.cat((-freqs, freqs))`, use `freqs` twice for cos and sin
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos().view(1, 1, seq_len, self.dim))
        self.register_buffer("sin_cached", emb.sin().view(1, 1, seq_len, self.dim))

    def forward(self, seq_len: int, position_ids: torch.Tensor = None):
        # If sequence length exceeds cached, rebuild cache
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len)
        return (
            self.cos_cached[:, :, :seq_len, ...],
            self.sin_cached[:, :, :seq_len, ...],
        )

# Initialize RoPE
rope = RotaryPositionalEmbedding(config['head_dim'], max_position_embeddings=config['max_sequence_length'])

This class pre-calculates the cosine and sine components for rotations, which are then applied to the query and key vectors in the self-attention layer.

5. Self-Attention Mechanism with GQA

The SelfAttention class incorporates the Grouped Query Attention and applies RoPE.

class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x: torch.Tensor):
        # Calculate RMS of the input, then scale x by its inverse
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x: torch.Tensor):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight

class SelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_heads = config['num_heads']
        self.head_dim = config['model_embedding_dim'] // config['num_heads']
        self.num_key_value_heads = config['num_key_value_heads']
        self.repeats = self.num_heads // self.num_key_value_heads # How many times to repeat KV heads

        # QKV projections
        self.wq = nn.Linear(config['model_embedding_dim'], config['num_heads'] * self.head_dim, bias=config['attention_bias'])
        self.wk = nn.Linear(config['model_embedding_dim'], self.num_key_value_heads * self.head_dim, bias=config['attention_bias'])
        self.wv = nn.Linear(config['model_embedding_dim'], self.num_key_value_heads * self.head_dim, bias=config['attention_bias'])
        self.wo = nn.Linear(config['num_heads'] * self.head_dim, config['model_embedding_dim'], bias=config['attention_bias'])

    def forward(self, x, rope_cos_sin):
        batch_size, sequence_length, _ = x.shape
        # Project input into Q, K, V
        q = self.wq(x).view(batch_size, sequence_length, self.num_heads, self.head_dim)
        k = self.wk(x).view(batch_size, sequence_length, self.num_key_value_heads, self.head_dim)
        v = self.wv(x).view(batch_size, sequence_length, self.num_key_value_heads, self.head_dim)

        # Apply RoPE to Q and K
        cos_cached, sin_cached = rope_cos_sin # Get pre-computed cos/sin values for sequence length
        q, k = apply_rotary_pos_emb(q, k, cos_cached, sin_cached)

        # Repeat key/value heads for GQA
        k = repeat_kv_heads(k, self.repeats)
        v = repeat_kv_heads(v, self.repeats)

        # Transpose for PyTorch's scaled_dot_product_attention (batch, num_heads, seq_len, head_dim)
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # Perform attention (PyTorch's optimized function)
        attn_output = F.scaled_dot_product_attention(q, k, v, is_causal=True)

        # Transpose back and combine heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, sequence_length, -1)

        # Output projection
        output = self.wo(attn_output)
        return output

6. Feed-Forward Network with SwiGLU

The FFN processes the attention output, using SwiGLU for enhanced non-linearity and gating.

class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        # Up-projection and gate projection for SwiGLU
        self.up_proj = nn.Linear(config['model_embedding_dim'], config['feed_forward_dim'], bias=False)
        self.gate_proj = nn.Linear(config['model_embedding_dim'], config['feed_forward_dim'], bias=False)
        self.down_proj = nn.Linear(config['feed_forward_dim'], config['model_embedding_dim'], bias=False)

    def forward(self, x):
        # SwiGLU activation: (SiLU(gate_proj(x))) * up_proj(x)
        activated_gate = F.silu(self.gate_proj(x))
        up_projected = self.up_proj(x)
        gated_output = activated_gate * up_projected
        # Down-projection
        output = self.down_proj(gated_output)
        return output

This structure creates a more dynamic and controlled information flow compared to simpler activation functions.

7. Assembling the Transformer Block

A TransformerBlock combines the self-attention and feed-forward layers with residual connections and pre-normalization.

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = SelfAttention(config)
        self.feed_forward = FeedForward(config)
        self.rms_norm1 = RMSNorm(config['model_embedding_dim'], eps=config['rms_norm_eps'])
        self.rms_norm2 = RMSNorm(config['model_embedding_dim'], eps=config['rms_norm_eps'])

    def forward(self, x, rope_cos_sin):
        # Pre-normalization (RMSNorm before attention) + Residual connection
        residual_attn = x
        x = self.rms_norm1(x)
        x = self.attention(x, rope_cos_sin)
        x = residual_attn + x # Add attention output to original input (residual)

        # Pre-normalization (RMSNorm before feed-forward) + Residual connection
        residual_ffn = x
        x = self.rms_norm2(x)
        x = self.feed_forward(x)
        x = residual_ffn + x # Add FFN output to previous output (residual)

        return x

This layered structure is fundamental to how Transformer models process sequences, allowing them to learn complex patterns.

8. The Complete Language Model

The LanguageModel class orchestrates all components: token embeddings, a stack of transformer blocks, and a final linear layer to predict the next token.

class LanguageModel(nn.Module):
    def __init__(self, config, vocab_size):
        super().__init__()
        self.config = config
        self.vocab_size = vocab_size
        # Token embeddings: converts token IDs to dense vectors
        self.embedding = nn.Embedding(vocab_size, config['model_embedding_dim'])
        # Stack of Transformer blocks
        self.layers = nn.ModuleList([TransformerBlock(config) for _ in range(config['num_decoder_layers'])])
        self.rms_norm = RMSNorm(config['model_embedding_dim'], eps=config['rms_norm_eps']) # Final normalization
        # Output head: projects model output back to vocabulary size for logits
        self.linear = nn.Linear(config['model_embedding_dim'], vocab_size, bias=False)

        # Weight Tying: Share weights between input embeddings and output layer
        # This saves memory and often improves performance
        self.linear.weight = self.embedding.weight

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.xavier_uniform_(module.weight)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, RMSNorm):
            torch.nn.init.ones_(module.weight)

    def forward(self, x, rope_cos_sin):
        # Convert token IDs to embeddings
        x = self.embedding(x)
        # Scale embeddings by sqrt(model_dimension) - common practice
        x = x * (self.config['model_embedding_dim'] ** -0.5)

        # Pass through all transformer layers
        for layer in self.layers:
            x = layer(x, rope_cos_sin)

        # Final normalization
        x = self.rms_norm(x)
        # Project to vocabulary size to get logits
        logits = self.linear(x)

        return logits

The weight tying (self.linear.weight = self.embedding.weight) is a clever optimization. It means the matrix used to project token embeddings at the input is the same matrix used to project the final transformer output back into vocabulary logits, saving memory and promoting symmetry.

Training Your Custom Qwen-3 LLM

With the architecture defined, it’s time to bring your Qwen-3 LLM to life through training.

Optimizer Setup: Muon and AdamW Combined

Since Muon is primarily designed for 2D matrices (like the weights of nn.Linear layers), other parameters (like embeddings or RMSNorm weights) are best handled by a standard optimizer like AdamW.

# Separate parameters for Muon and AdamW optimizers
muon_params = []
adamw_params = []

# Iterate through model parameters and assign them
for n, p in model.named_parameters():
    if p.requires_grad:
        # Muon handles 2D matrices that are not embeddings or RMSNorm weights
        if p.ndim == 2 and 'embedding' not in n and 'rms_norm' not in n and 'linear' not in n:
            muon_params.append(p)
        else:
            adamw_params.append(p) # AdamW handles embeddings, RMSNorm weights, and 1D biases

print(f"Parameters for Muon: {len(muon_params)}")
print(f"Parameters for AdamW: {len(adamw_params)}")

optimizer_muon = Muon(muon_params, lr=config['muon_learning_rate'])
# AdamW typically needs a much lower learning rate than Muon
optimizer_adamw = AdamW(adamw_params, lr=config['muon_learning_rate']/10, weight_decay=config['weight_decay'])

This careful separation ensures each part of your model benefits from the most suitable optimization strategy.

The Training Loop

The training loop iterates through your data, performing forward and backward passes, and updating the model’s weights.

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = LanguageModel(config, config['vocab_size']).to(device)

# Prepare data loaders
train_size = int(0.9 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(full_dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=config['batch_size'], shuffle=False, drop_last=True)

# Learning Rate Scheduler (e.g., Cosine Annealing with Warmup)
# This is a critical component for stable and effective training
warmup_steps = config['max_steps'] // 10 # 10% of total steps for warmup
def lr_lambda(current_step):
    if current_step < warmup_steps:
        return float(current_step) / float(max(1, warmup_steps))
    return max(0.0, 0.5 * (1.0 + math.cos(math.pi * (current_step - warmup_steps) / (config['max_steps'] - warmup_steps))))

scheduler_muon = torch.optim.lr_scheduler.LambdaLR(optimizer_muon, lr_lambda)
scheduler_adamw = torch.optim.lr_scheduler.LambdaLR(optimizer_adamw, lr_lambda)

best_val_loss = float('inf')

print(f"Starting training on {device}...")
for step in tqdm.tqdm(range(config['max_steps'])):
    # --- Training Phase ---
    model.train()
    inputs, labels = next(iter(train_loader)) # Get a batch
    inputs, labels = inputs.to(device), labels.to(device)

    # Get RoPE cos/sin values for current sequence length
    rope_cos_sin = rope(config['max_sequence_length'], position_ids=None)

    # Forward pass
    logits = model(inputs, rope_cos_sin)
    loss = F.cross_entropy(logits.view(-1, config['vocab_size']), labels.view(-1))
    loss = loss / config['gradient_accumulation_steps'] # Scale loss for gradient accumulation

    # Backward pass
    loss.backward()

    # Gradient clipping (prevents exploding gradients)
    if config['gradient_clipping'] > 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), config['gradient_clipping'])

    # Accumulate gradients and update weights
    if (step + 1) % config['gradient_accumulation_steps'] == 0:
        optimizer_muon.step()
        optimizer_adamw.step()
        optimizer_muon.zero_grad()
        optimizer_adamw.zero_grad()
        scheduler_muon.step()
        scheduler_adamw.step()

    # --- Evaluation Phase ---
    if (step + 1) % config['eval_interval'] == 0:
        model.eval()
        val_loss_sum = 0
        val_correct_predictions = 0
        val_total_predictions = 0
        with torch.no_grad():
            for val_inputs, val_labels in val_loader:
                val_inputs, val_labels = val_inputs.to(device), val_labels.to(device)
                val_rope_cos_sin = rope(config['max_sequence_length'], position_ids=None)
                val_logits = model(val_inputs, val_rope_cos_sin)
                val_loss_sum += F.cross_entropy(val_logits.view(-1, config['vocab_size']), val_labels.view(-1)).item()

                # Calculate accuracy
                predicted_tokens = torch.argmax(val_logits, dim=-1)
                val_correct_predictions += (predicted_tokens == val_labels).sum().item()
                val_total_predictions += val_labels.numel()

        avg_val_loss = val_loss_sum / len(val_loader)
        val_accuracy = val_correct_predictions / val_total_predictions
        val_perplexity = torch.exp(torch.tensor(avg_val_loss)).item()

        print(f"\nStep {step+1}/{config['max_steps']} | Train Loss: {loss.item() * config['gradient_accumulation_steps']:.4f} | "
              f"Val Loss: {avg_val_loss:.4f} | Val Accuracy: {val_accuracy:.4f} | Val Perplexity: {val_perplexity:.2f}")

        # Save the best model
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            torch.save(model.state_dict(), 'best_model.pt')
            print("Saved best_model.pt")

    # Save final model state
    if (step + 1) == config['max_steps']:
        torch.save(model.state_dict(), 'final_model.pt')
        print("Training complete. Saved final_model.pt")

This loop ensures your model systematically learns from the data, periodically checks its performance on unseen validation data, and saves the best-performing version. When you train Qwen-3 LLM scratch, patience and careful monitoring are key.

Interacting with Your Trained Qwen-3 LLM (Inference)

Once your model is trained, it’s time to see it generate text.

# Load the best or final model
model = LanguageModel(config, config['vocab_size']).to(device)
model.load_state_dict(torch.load('final_model.pt', map_location=device)) # Or 'best_model.pt'
model.eval() # Set model to evaluation mode

# Generation parameters
temperature = 0.7   # Controls randomness (higher = more random)
top_k = 50          # Sample from K most likely tokens
top_p = 0.9         # Sample from smallest set of tokens whose cumulative probability exceeds P
max_new_tokens = 100 # Max tokens to generate

def generate_text(prompt, model, tokenizer, max_new_tokens, temperature, top_k, top_p, device):
    encoded_prompt = tokenizer.encode(prompt, return_tensors='pt').to(device)
    generated_tokens = encoded_prompt

    for _ in range(max_new_tokens):
        # Get RoPE cos/sin values for current sequence length
        current_sequence_length = generated_tokens.shape[1]
        rope_cos_sin = rope(current_sequence_length, position_ids=None)

        # Get logits for the last token
        logits = model(generated_tokens, rope_cos_sin)[:, -1, :] # Only need last token's logits

        # Apply temperature
        if temperature == 0:
            probs = F.softmax(logits, dim=-1)
        else:
            probs = F.softmax(logits / temperature, dim=-1)

        # Top-K sampling
        if top_k is not None:
            v, i = torch.topk(probs, min(top_k, probs.size(-1)))
            probs[probs < v[:, [-1]]] = 0 # Zero out probabilities below the Kth smallest
            probs = probs / probs.sum() # Re-normalize

        # Top-P (nucleus) sampling
        if top_p is not None and top_p < 1.0:
            sorted_probs, sorted_indices = torch.sort(probs, descending=True)
            cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

            # Remove tokens with cumulative probability above top_p threshold
            sorted_indices_to_remove = cumulative_probs > top_p
            # Shift the indices to the right to keep at least one token
            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
            sorted_indices_to_remove[..., 0] = 0

            indices_to_remove = sorted_indices[sorted_indices_to_remove]
            probs[:, indices_to_remove] = 0
            probs = probs / probs.sum() # Re-normalize

        # Sample the next token
        next_token_id = torch.multinomial(probs, num_samples=1)
        generated_tokens = torch.cat((generated_tokens, next_token_id), dim=1)

        # Stop if end of sequence token is generated (adjust for your tokenizer's EOS)
        # if next_token_id.item() == tokenizer.eos_token_id: # Example
        #     break

    return tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

# Example Usage
print("\n--- Generating Text ---")
input_prompt = "The future of artificial intelligence"
generated_text = generate_text(input_prompt, model, tokenizer, max_new_tokens, temperature, top_k, top_p, device)
print(generated_text)

print("\n--- Interactive Chat (type 'quit' to exit) ---")
while True:
    user_input = input("You: ")
    if user_input.lower() == 'quit':
        break
    response = generate_text(user_input, model, tokenizer, max_new_tokens=50, temperature=0.8, top_k=50, top_p=0.9, device)
    print(f"Qwen-3: {response}")

Even with a small GPU and limited training time (e.g., 15 minutes as mentioned in the source), the model can produce surprisingly coherent, albeit simple, text. The true power emerges with more substantial computation and longer training periods.

Conclusion: Your Path to LLM Mastery

Learning to train Qwen-3 LLM scratch is more than just a coding exercise; it’s an immersive journey into the heart of modern artificial intelligence. You’ve explored cutting-edge concepts like Grouped Query Attention and the innovative Muon Optimizer, gaining an unparalleled understanding of how these powerful models function.

By following this step-by-step tutorial, you’ve not only built a Qwen-3 model from its foundational components but also acquired the practical skills to configure, optimize, and interact with it. This raw, unfiltered machine learning mastery will empower you to tackle complex AI challenges, customize models for specific applications, and contribute meaningfully to the rapidly evolving field of large language models.

Now, take what you’ve learned. Experiment with different hyperparameters, expand your dataset, and explore even longer training runs. The world of LLM development is vast and full of possibilities, and you now have the tools to navigate it with confidence and innovation. Dive deeper, build more, and continue to push the boundaries of AI!

Discover more from teguhteja.id

Subscribe to get the latest posts sent to your email.

Mastering LLMs: How to Train Qwen-3 LLM From Scratch for Unrivaled Performance

Why Embark on the Journey to Train Qwen-3 LLM Scratch?

Unpacking Qwen-3’s Cutting-Edge Features

Grouped Query Attention (GQA)

The Revolutionary Muon Optimizer

Rotary Positional Embeddings (RoPE)

SwiGLU Activation Function

RMSNorm (Root Mean Square Normalization)

The Journey Begins: Setting Up Your Environment

Prerequisites for Your Training Endeavor

Environment Setup in Google Colab

Configuring Your Qwen-3 Model

Building the Qwen-3 Architecture: Step-by-Step Components

1. Grouped Query Attention (GQA) Implementation

2. The Powerful Muon Optimizer Class

3. Data Loading and Preparation Pipeline

4. Rotary Positional Embeddings (RoPE) Implementation

5. Self-Attention Mechanism with GQA

6. Feed-Forward Network with SwiGLU

7. Assembling the Transformer Block

8. The Complete Language Model

Training Your Custom Qwen-3 LLM

Optimizer Setup: Muon and AdamW Combined

The Training Loop

Interacting with Your Trained Qwen-3 LLM (Inference)

Conclusion: Your Path to LLM Mastery

Like this:

Related

Discover more from teguhteja.id

Leave a ReplyCancel reply

Mastering LLMs: How to Train Qwen-3 LLM From Scratch for Unrivaled Performance

Why Embark on the Journey to Train Qwen-3 LLM Scratch?

Unpacking Qwen-3’s Cutting-Edge Features

Grouped Query Attention (GQA)

The Revolutionary Muon Optimizer

Rotary Positional Embeddings (RoPE)

SwiGLU Activation Function

RMSNorm (Root Mean Square Normalization)

The Journey Begins: Setting Up Your Environment

Prerequisites for Your Training Endeavor

Environment Setup in Google Colab

Configuring Your Qwen-3 Model

Building the Qwen-3 Architecture: Step-by-Step Components

1. Grouped Query Attention (GQA) Implementation

2. The Powerful Muon Optimizer Class

3. Data Loading and Preparation Pipeline

4. Rotary Positional Embeddings (RoPE) Implementation

5. Self-Attention Mechanism with GQA

6. Feed-Forward Network with SwiGLU

7. Assembling the Transformer Block

8. The Complete Language Model

Training Your Custom Qwen-3 LLM

Optimizer Setup: Muon and AdamW Combined

The Training Loop

Interacting with Your Trained Qwen-3 LLM (Inference)

Conclusion: Your Path to LLM Mastery

Share this:

Like this:

Related

Discover more from teguhteja.id

Leave a ReplyCancel reply