05 — Transformer Architecture

Estimasi: 8 jam Tujuan: Paham detail Transformer — arsitektur yang mendasari semua LLM modern.

WAJIB: Tonton Karpathy "Let's build GPT from scratch" (2 jam). File ini adalah summary + practice.

Kenapa Materi Ini Penting?

Transformer adalah arsitektur paling penting di sejarah AI. Sejak paper "Attention is All You Need" (2017), hampir semua breakthrough — GPT, BERT, T5, Llama, Claude, Gemini, Stable Diffusion (text encoder), DALL-E, Whisper — built on top of Transformer. Kalau kamu mau jadi GenAI engineer beneran (bukan sekadar API user), ini adalah materi nomor satu yang tidak boleh kamu lewati.

Bayangkan dunia AI tahun 2017: model translation pakai LSTM dengan attention, baca sequence kata per kata, lambat dan lemah di kalimat panjang. Lalu Google publish paper berisi gagasan radikal: "Buang RNN. Buang convolution. Cukup pakai attention saja." Hasilnya? Model lebih cepat di-train, lebih akurat, dan scalable ke ukuran yang sebelumnya mustahil. ChatGPT yang menggemparkan dunia tahun 2022 adalah hasil scale up arsitektur ini ke 175B parameter. Pemahaman dalam tentang Transformer = pemahaman dalam tentang fondasi era AI sekarang.

Tiga insight kunci yang akan kamu kuasai: (1) Self-attention — mekanisme di mana tiap kata "menyorot lampu" ke kata lain yang relevan, (2) Multi-head attention — banyak perspektif paralel yang menangkap aspek berbeda, dan (3) Decoder-only architecture — pola di balik GPT, Claude, Llama yang mendominasi LLM modern.

Peta Mental: Transformer Big Picture

Cara Membaca Diagram:

Kiri = input mentah, kanan = output token
Tengah = N transformer block (biasanya 12-96 di LLM modern)
Token embedding + positional encoding di-jumlah di awal
LayerNorm + LM head di akhir untuk produce next token

Walkthrough Step-by-Step:

Input tokens dari tokenizer
Token Embedding ubah token ID → vektor d_model
Positional Encoding kasih info posisi
Jumlahkan keduanya jadi input transformer
Lewati N Transformer Block berurutan (tiap block: MHA + FFN + residual + norm)
LayerNorm akhir
LM Head project ke vocab size, prediksi distribusi token berikutnya
Sample / argmax → output token

Analogi Sehari-hari: Transformer = pabrik proses bahasa berlapis. Tiap block = stasiun kerja yang menyaring dan memperdalam pemahaman. Semakin dalam, semakin kompleks pola yang dia tangkap. LM head di akhir = "bagian penjualan" yang produce output siap kirim.

Diagram statis Mermaid sebagai fallback:

flowchart TD
    Tokens["📝 Input tokens<br/>'Saya suka AI'"] --> TE["🔤 Token Embedding"]
    Tokens --> PE["📍 Positional Encoding"]
    TE --> Sum["➕"]
    PE --> Sum
    Sum --> Block1["🔷 Transformer Block 1<br/>(MHA + FFN)"]
    Block1 --> Block2["🔷 Transformer Block 2"]
    Block2 --> BlockN["🔷 Transformer Block N"]
    BlockN --> Norm["⚖️ LayerNorm"]
    Norm --> Head["🎯 LM Head<br/>(predict next token)"]
    Head --> Out["📝 Output token"]
    style Tokens fill:#dbeafe
    style PE fill:#fef3c7
    style Block1 fill:#fed7aa
    style Block2 fill:#fed7aa
    style BlockN fill:#fed7aa
    style Out fill:#d1fae5

Pembukaan

Transformer (2017) adalah arsitektur paling penting di sejarah AI modern. ChatGPT, Claude, Gemini, Llama — semua Transformer. Tanpa paham ini, kamu cuma "user", bukan "engineer".

Bagian 1 — Big Picture

Sebelum Transformer

RNN/LSTM:

Process kata satu per satu (sequential)
Bottleneck: hidden state ukuran terbatas
Hard to parallelize

Insight Transformer

"Attention is all you need"

Process semua kata sekaligus dengan attention
Highly parallelizable di GPU
Better long-range dependency

Bagian 2 — Self-Attention (Inti Transformer)

Konsep

Tiap kata "bertanya" ke semua kata lain: "kamu seberapa relevan untukku?"

Hasil: representasi tiap kata yang dipengaruhi konteks.

Analogi Self-Attention: Bayangkan diskusi kelompok. Tiap orang punya pertanyaan (Query) — "saya butuh info tentang X". Tiap orang juga punya label/kartu nama (Key) yang menunjukkan keahliannya, dan isi konten (Value) — pengetahuan yang siap dia bagikan. Saat orang A bertanya, dia melihat semua kartu nama (Key) di ruangan, milih yang paling cocok dengan pertanyaannya, dan menyerap konten (Value) dari orang-orang itu — semakin cocok kartu, semakin besar bobot kontennya. Hasil akhir: orang A mendapat jawaban gabungan dari semua orang, weighted by relevance. Itulah self-attention.

Analogi Lain — Sorot Lampu: Bayangkan kamu di teater dengan banyak aktor di panggung. Saat kamu fokus ke kata "it" dalam kalimat "The cat sat on the mat because it was tired", kamu menyorot lampu paling terang ke "cat" (subject), lampu agak terang ke "tired" (predikat), lampu redup ke "mat", "the", "on". Bobot lampu inilah attention score.

Visualisasi Self-Attention Q/K/V

Diagram statis Mermaid sebagai fallback:

flowchart LR
    I["📝 Input<br/>(seq, d)"] --> Q["🔑 Query<br/>X·W_Q"]
    I --> K["🗝️ Key<br/>X·W_K"]
    I --> V["💎 Value<br/>X·W_V"]
    Q --> M["⚙️ Q · K^T"]
    K --> M
    M --> Scale["÷ √d"]
    Scale --> SM["📊 Softmax"]
    SM --> A["⚡ Attention<br/>weights"]
    V --> Mul["× V"]
    A --> Mul
    Mul --> O["✅ Output<br/>(seq, d)"]
    style I fill:#dbeafe
    style Q fill:#fef3c7
    style K fill:#fed7aa
    style V fill:#fce7f3
    style O fill:#d1fae5

Mathematical Formulation

Untuk setiap token, hitung 3 vektor:

Query (Q) — apa yang kucari?
Key (K) — apa yang ku-tawarkan?
Value (V) — apa yang akan ku-bagikan?

Q = X × W_Q
K = X × W_K
V = X × W_V

W_Q, W_K, W_V = learned weight matrices.

Attention Score

Attention(Q, K, V) = softmax(Q × K^T / √d) × V

Step:

Q × K^T — similarity matrix antar tokens
/ √d — scaling untuk stabilitas
softmax — convert ke probabilitas distribution
× V — weighted sum

Visualisasi

Untuk kalimat "The animal didn't cross the street because it was too tired":

Saat process "it":

Attention ke "animal": tinggi (subject yang dirujuk)
Attention ke "tired": tinggi (predikat yang relevan)
Attention ke "the", "because": rendah

Model otomatis belajar relasi ini.

Implementation Sederhana

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.d_model = d_model
    
    def forward(self, x):
        # x: (batch, seq_len, d_model)
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Attention scores
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_model)
        attention = F.softmax(scores, dim=-1)
        
        # Apply
        out = attention @ V
        return out

Bagian 3 — Multi-Head Attention

Single attention = 1 perspektif. Multi-head = banyak perspektif paralel.

Analogi Multi-Head Attention: Single attention = satu pasang mata yang melihat satu aspek hubungan. Multi-head = banyak pasang mata yang fokus ke aspek berbeda secara paralel — head 1 fokus ke siapa subjek, head 2 fokus ke kata sifat yang melekat, head 3 fokus ke relasi waktu, dst. Tiap head punya W_Q, W_K, W_V sendiri sehingga belajar pola berbeda. Hasil semua head digabung lalu di-project balik. Itu sebabnya Transformer "paham" banyak dimensi konteks sekaligus.

Visualisasi Multi-Head

Cara Membaca Diagram:

Kiri = input split jadi h heads paralel
Tengah = tiap head punya W_Q, W_K, W_V berbeda dan belajar pola berbeda
Kanan = output di-concat lalu di-project balik ke d_model
d_model harus habis dibagi h (contoh: 512 = 8 × 64)

Walkthrough Step-by-Step:

Input (seq, d_model) dipecah jadi h heads, tiap head dim d_k = d_model / h
Head 1 mungkin belajar fokus relasi syntactic (subject-verb)
Head 2 fokus semantik (kata mirip makna)
Head 3 fokus posisi (kata berdekatan)
Head 4 fokus coreference ("it" → noun mana)
Concat output semua head jadi (seq, d_model)
W_O linear project balik untuk mixing antar head

Analogi Sehari-hari: Multi-head = panel ahli dengan keahlian berbeda. Saat baca kalimat, ahli grammar perhatikan struktur, ahli makna perhatikan kata, ahli koreferensi perhatikan rujukan. Tiap ahli kasih opini, lalu opini digabung jadi pemahaman lengkap.

Diagram statis Mermaid sebagai fallback:

flowchart TD
    I["📝 Input<br/>(seq, d_model)"] --> Split["✂️ Split jadi h heads"]
    Split --> H1["👁️ Head 1<br/>fokus syntax"]
    Split --> H2["👁️ Head 2<br/>fokus semantik"]
    Split --> H3["👁️ Head 3<br/>fokus position"]
    Split --> H4["👁️ Head 4<br/>fokus coreference"]
    H1 --> Concat["🔗 Concat"]
    H2 --> Concat
    H3 --> Concat
    H4 --> Concat
    Concat --> WO["⚙️ W_O linear"]
    WO --> O["✅ Output<br/>(seq, d_model)"]
    style I fill:#dbeafe
    style H1 fill:#fef3c7
    style H2 fill:#fed7aa
    style H3 fill:#fce7f3
    style H4 fill:#e9d5ff
    style O fill:#d1fae5

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        B, T, D = x.shape
        
        # Reshape ke (B, num_heads, T, d_k)
        Q = self.W_q(x).view(B, T, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, T, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, T, self.num_heads, self.d_k).transpose(1, 2)
        
        # Attention per head
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
        attention = F.softmax(scores, dim=-1)
        out = attention @ V
        
        # Concat heads
        out = out.transpose(1, 2).contiguous().view(B, T, D)
        return self.W_o(out)

Insight: beberapa head fokus ke syntactic relation, beberapa ke semantic, dll. Model belajar specialisasi otomatis.

Bagian 4 — Positional Encoding

Self-attention tidak care urutan ("dog bites man" vs "man bites dog" = sama). Solusi: tambah info posisi ke embedding.

Analogi Positional Encoding: Bayangkan kamu lempar 10 kelereng warna ke meja — semuanya terlihat sama persis tanpa urutan. Sekarang tempelkan stiker nomor (1, 2, 3, ...) ke tiap kelereng — sekarang kamu tahu mana yang pertama, kedua, dst. Positional encoding = "stiker nomor" ke tiap token. Tanpa ini, "Saya pukul Budi" = "Budi pukul Saya" bagi Transformer. Dengan ini, model tahu posisi tiap kata dan bisa belajar pola yang sensitif urutan.

Visualisasi: Bagaimana Positional Encoding Ditambahkan

Cara Membaca Diagram:

Kiri atas = embedding kata, kiri bawah = positional encoding
Tengah = element-wise add (dimensi sama, tinggal +)
Kanan = hasil siap masuk ke attention
Tanpa positional encoding, attention buta urutan

Walkthrough Step-by-Step:

Token embedding shape (seq, d_model) — vektor makna tiap kata
Positional encoding shape sama, tapi nilainya tergantung posisi (0, 1, 2, ...)
Element-wise add: final[i] = token_emb[i] + pos_emb[i]
Model belajar memisahkan komponen "makna" dan "posisi" dari hasil add
Hasilnya dimasukkan ke attention block

Analogi Sehari-hari: Positional encoding = stiker nomor di kelereng warna. Tanpa stiker, semua kelereng terlihat sama urutan. Dengan stiker, kamu tahu mana pertama, kedua, dst. Transformer butuh ini karena attention sendiri buta urutan.

Diagram statis Mermaid sebagai fallback:

flowchart LR
    Tok["🔤 Token Embedding<br/>('Saya', 'suka', 'AI')"] --> Add["➕ Element-wise add"]
    Pos["📍 Positional Encoding<br/>(pos 0, 1, 2)"] --> Add
    Add --> Out["✅ Final Input<br/>(siap ke attention)"]
    style Tok fill:#dbeafe
    style Pos fill:#fef3c7
    style Out fill:#d1fae5

Sinusoidal (Vanilla)

def positional_encoding(seq_len, d_model):
    pos = torch.arange(seq_len).unsqueeze(1)
    div = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
    
    pe = torch.zeros(seq_len, d_model)
    pe[:, 0::2] = torch.sin(pos * div)
    pe[:, 1::2] = torch.cos(pos * div)
    return pe

Learned Positional Embedding (Modern)

self.pos_embedding = nn.Embedding(max_seq_len, d_model)
positions = torch.arange(seq_len)
pos_embed = self.pos_embedding(positions)
input = token_embed + pos_embed

RoPE (Rotary Position Embedding)

State-of-the-art di Llama, GPT-NeoX. Lebih bagus untuk extrapolation.

Bagian 5 — Transformer Block

Analogi Transformer Block: Tiap block = satu putaran diskusi panel. Pertama, semua peserta saling tukar info dengan attention (siapa yang relevan untuk siapa). Kedua, tiap peserta merefleksikan sendiri apa yang dia dengar lewat feed-forward (FFN) — semacam "mencerna" info yang baru masuk. Residual connection (skip connection) = catatan asli peserta tidak hilang, info baru hanya ditambahkan. LayerNorm = aturan diskusi tetap rapi (tidak ada peserta yang kelewat dominan). Block ini diulang N kali (biasanya 6-96 kali di LLM modern), tiap putaran semakin halus pemahamannya.

Visualisasi Transformer Block (Pre-Norm Style, Modern)

Cara Membaca Diagram:

Atas-ke-bawah = arah forward dalam satu block
Edge putus-putus dari kiri = residual / skip connection
Pre-norm = LayerNorm sebelum sub-layer (modern, GPT/Llama)
Block ini diulang N kali tanpa modifikasi struktur

Walkthrough Step-by-Step:

Input x masuk
LayerNorm (pre-norm) sebelum attention
Multi-Head Attention kerja di hasil norm
Residual +: x + attn(norm(x)) — input asli ditambah hasil attention
LayerNorm kedua sebelum FFN
FFN (Linear → GELU → Linear, biasanya hidden 4× d_model)
Residual +: hasil tambahan FFN ditambah ke r1
Output siap untuk block berikutnya

Analogi Sehari-hari: Transformer block = putaran diskusi panel + refleksi. Attention = saling tukar info antar peserta. FFN = tiap peserta mencerna sendiri. Residual = catatan asli tidak hilang, info baru hanya ditambahkan. LayerNorm = aturan diskusi tetap rapi.

Diagram statis Mermaid sebagai fallback:

flowchart TD
    In["📥 Input x"] --> N1["⚖️ LayerNorm"]
    N1 --> MHA["👁️ Multi-Head Attention"]
    MHA --> R1["➕ Residual<br/>x + attn(x)"]
    In --> R1
    R1 --> N2["⚖️ LayerNorm"]
    N2 --> FF["🧠 FFN<br/>(Linear→GELU→Linear)"]
    FF --> R2["➕ Residual<br/>x + ffn(x)"]
    R1 --> R2
    R2 --> Out["📤 Output"]
    style In fill:#dbeafe
    style MHA fill:#fed7aa
    style FF fill:#fce7f3
    style Out fill:#d1fae5

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff=2048):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x):
        # Attention with residual
        x = x + self.attn(self.norm1(x))
        # Feed-forward with residual
        x = x + self.ff(self.norm2(x))
        return x

Komponen:

Multi-head attention
Feed-forward (2 linear layer dengan activation)
Layer Norm (sebelum atau setelah, pre-norm vs post-norm)
Residual connection (skip connection)

Bagian 6 — Full Transformer Architectures

Encoder-only (BERT)

Pakai bidirectional attention. Bagus untuk:

Klasifikasi
NER
Question answering (extractive)

Decoder-only (GPT, Llama, Claude)

Causal attention (hanya bisa lihat token sebelumnya). Bagus untuk:

Text generation
Conversation
Code generation

Encoder-Decoder (T5, BART)

Encoder process input, decoder generate output. Bagus untuk:

Translation
Summarization

LLM modern dominan = Decoder-only. GPT, Claude, Llama, Mistral — semua decoder-only.

Visualisasi: 3 Varian Transformer

Cara Membaca Diagram:

3 baris = 3 arsitektur berbeda
Encoder (cyan) = bidirectional, lihat kiri-kanan
Decoder (amber) = causal, hanya lihat masa lalu
Encoder-Decoder = kombinasi keduanya dengan cross-attention

Walkthrough Step-by-Step:

Encoder-only (BERT): input → encoder bidirectional → [CLS] / per-token output. Bagus untuk klasifikasi & NER.
Decoder-only (GPT/Claude): prompt → decoder causal → autoregressive token. Dominan di LLM modern.
Encoder-Decoder (T5): source → encoder → decoder dengan cross-attention → target. Bagus translation.

Analogi Sehari-hari: BERT = pembaca yang baca seluruh paragraf dulu lalu jawab pertanyaan. GPT = penulis yang menulis kata demi kata, tidak bisa intip masa depan. T5 = penerjemah yang baca dulu source, lalu tulis target sambil sering nengok ke source.

Diagram statis Mermaid sebagai fallback:

flowchart LR
    subgraph EO["🔷 Encoder-only (BERT)"]
        EOIn["📝 Input"] --> EOEnc["Encoder<br/>(bidirectional)"]
        EOEnc --> EOOut["🎯 [CLS] / per-token<br/>(class/NER)"]
    end
    subgraph DO["🔶 Decoder-only (GPT/Claude)"]
        DOIn["📝 Prompt"] --> DODec["Decoder<br/>(causal)"]
        DODec --> DOOut["📝 Generated text<br/>(autoregressive)"]
    end
    subgraph ED["🔵 Encoder-Decoder (T5)"]
        EDIn["📝 Source"] --> EDEnc["Encoder"]
        EDEnc --> EDDec["Decoder<br/>(cross-attention)"]
        EDDec --> EDOut["📝 Target"]
    end
    style EO fill:#dbeafe
    style DO fill:#fef3c7
    style ED fill:#fce7f3

Comparison Table: GPT vs BERT vs T5

Aspek	GPT (decoder-only)	BERT (encoder-only)	T5 (enc-dec)
Attention	Causal (left-to-right)	Bidirectional	Encoder bi, decoder causal
Training objective	Next token prediction	Masked LM + NSP	Span corruption + seq2seq
Best for	Generation, chat, code	Classification, NER, QA	Translation, summarization
Output style	Autoregressive token	Per-token / [CLS]	Sequence to sequence
Modern usage	⭐⭐⭐ Dominan (LLM)	⭐⭐ Klasifikasi NLP	⭐ Spesifik task
Examples	GPT-4, Claude, Llama	BERT, RoBERTa, IndoBERT	T5, FLAN-T5, BART

Bagian 7 — Causal (Masked) Attention

Untuk decoder, mask attention supaya token hanya lihat sebelum:

Analogi Causal Mask: Bayangkan ujian. Murid (token) cuma boleh lihat catatan murid yang duduk di depan dan kiri dia, tidak boleh nengok ke belakang/kanan (token masa depan). Itu sebabnya disebut "causal" — sebab harus mendahului akibat. Tanpa mask, model akan "curang" lihat token masa depan saat training, dan saat generate jadi berantakan karena info masa depan tidak ada.

Visualisasi Causal Mask

flowchart LR
    subgraph M["Mask Matrix (5×5)"]
        M1["1 0 0 0 0<br/>1 1 0 0 0<br/>1 1 1 0 0<br/>1 1 1 1 0<br/>1 1 1 1 1<br/>(lower triangular)"]
    end
    style M fill:#fef3c7

def causal_attention(Q, K, V, mask=None):
    scores = Q @ K.transpose(-2, -1) / math.sqrt(K.size(-1))
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    
    attention = F.softmax(scores, dim=-1)
    return attention @ V

# Causal mask: lower triangular
seq_len = 10
mask = torch.tril(torch.ones(seq_len, seq_len))

Bagian 8 — Mini GPT Implementation

class MiniGPT(nn.Module):
    def __init__(self, vocab_size, d_model=128, num_heads=4, num_layers=4, max_len=128):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_embedding = nn.Embedding(max_len, d_model)
        
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads)
            for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size)
    
    def forward(self, x):
        # x: (batch, seq_len)
        B, T = x.shape
        
        # Embeddings
        token_emb = self.token_embedding(x)
        pos_emb = self.pos_embedding(torch.arange(T, device=x.device))
        x = token_emb + pos_emb
        
        # Blocks
        for block in self.blocks:
            x = block(x)
        
        # Output
        x = self.norm(x)
        logits = self.head(x)    # (B, T, vocab_size)
        return logits
    
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            # Predict next token
            logits = self(idx[:, -self.max_len:])
            logits = logits[:, -1, :]    # last position
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_token], dim=1)
        return idx

Build dari nol di Karpathy "Let's build GPT". Wajib tonton.

Bagian 9 — Scaling Laws

Riset menunjukkan:

Model lebih besar (more params) → better
More data → better
More compute → better

Rasio optimal antar 3 ini = scaling laws (paper "Chinchilla", 2022).

GPT-3: 175B params, dilatih ~300B token. Chinchilla: 70B params, 1.4T token → outperform GPT-3.

Insight: scale matters, tapi efficient scaling (compute optimal) lebih penting dari raw size.

Bagian 10 — Common Mistakes & FAQ

1. Lupa Causal Mask di Decoder Training

# ❌ Tanpa mask, model "curang" lihat masa depan
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
attn = F.softmax(scores, dim=-1)

# ✅ Pakai mask
mask = torch.tril(torch.ones(T, T)).to(device)
scores = scores.masked_fill(mask == 0, float("-inf"))
attn = F.softmax(scores, dim=-1)

2. Attention Score Tidak Di-scale dengan √d

# ❌ Variance besar, softmax saturate
scores = Q @ K.transpose(-2, -1)

# ✅ Scaling penting untuk stabilitas
scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)

3. Positional Encoding Tidak Ditambahkan

# ❌ Tanpa positional info, "saya pukul kamu" = "kamu pukul saya"
x = self.token_embedding(input_ids)
x = self.transformer_blocks(x)

# ✅
x = self.token_embedding(input_ids) + self.pos_embedding(positions)
x = self.transformer_blocks(x)

4. d_model Tidak Habis Dibagi num_heads

# ❌ d_model=128, num_heads=5 → tidak rata
attn = MultiHeadAttention(d_model=128, num_heads=5)    # 128/5 = 25.6, error

# ✅ Pilih num_heads yang membagi d_model habis
attn = MultiHeadAttention(d_model=128, num_heads=8)    # 128/8 = 16 ✓

5. Lupa Apply LayerNorm

# ❌ Tanpa LayerNorm, training Transformer dalam tidak konvergen
def forward(self, x):
    x = x + self.attn(x)
    x = x + self.ffn(x)
    return x

# ✅ Pre-norm (modern, GPT-style)
def forward(self, x):
    x = x + self.attn(self.norm1(x))
    x = x + self.ffn(self.norm2(x))
    return x

6. Memory Explode di Sequence Panjang

# Self-attention = O(n²) memory di seq length
# 4096 token → 16M attention scores per head per layer
# 32k token → 1B! → OOM

# ✅ Solusi modern:
# - Flash Attention (memory-efficient)
# - Sliding window attention (Mistral)
# - Linear attention (Performer)
# - Grouped-query attention (Llama)

7. Generate Tanpa KV Cache

# ❌ Tiap step, recompute semua attention dari awal — super lambat
for step in range(max_tokens):
    logits = model(all_tokens_so_far)        # recompute everything

# ✅ Pakai KV cache: simpan K, V dari step sebelumnya
# (HuggingFace generate() default sudah pakai KV cache)
output = model.generate(input_ids, max_new_tokens=100, use_cache=True)

Bagian 11 — Sampling Strategy untuk Generation

Strategy	Cara Kerja	Pros	Cons
Greedy	Pilih token dengan prob tertinggi	Deterministik, cepat	Repetitif, membosankan
Beam Search	Track top-k beam paralel	Lebih koheren	Mahal, masih repetitif
Temperature	Tajamkan/lembutkan distribusi	Kontrol kreativitas	Tidak filter aktif
Top-k	Sampel dari k token teratas	Hindari token jelek	k harus di-tune
Top-p (nucleus)	Sampel dari probability mass p	Adaptif	Default modern

# Contoh kombinasi modern (yang dipakai ChatGPT)
output = model.generate(
    input_ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,        # 0=deterministik, 1=natural, 2=chaos
    top_p=0.9,              # nucleus sampling
    top_k=50,               # cap top 50 token
    repetition_penalty=1.1, # hindari pengulangan
)

Cek Pemahaman

Bisa jelaskan self-attention konsep & rumus?
Tahu kenapa multi-head?
Tahu kenapa positional encoding penting?
Tahu beda encoder-only, decoder-only, encoder-decoder?
Tahu causal vs bidirectional attention?
Bisa baca + paham simple transformer code?

Challenge 6.5

Challenge 1 — Karpathy GPT (WAJIB)

Tonton "Let's build GPT from scratch" (2 jam). Implement bareng. Push code ke GitHub.

Challenge 2 — Self-Attention dari Nol

Implement self-attention manual (NumPy/PyTorch). Test dengan toy example.

Challenge 3 — Visualize Attention

Pakai pretrained BERT, visualize attention weights untuk kalimat. Pakai library bertviz.

Challenge 4 — Train Mini GPT

Dataset: text Indonesia (e.g. Wikipedia subset, atau koleksi puisi).

Char-level model
Train mini GPT
Generate text

Selanjutnya: 06-huggingface.md