Neural Networks Mendalam

6 jam11 min baca
Tujuan

Layer types, activation, regularization, dan teknik training neural network.

02 — Neural Networks Mendalam

Estimasi: 6 jam Tujuan: Layer types, activation, regularization, dan teknik training neural network.


Kenapa Materi Ini Penting?

Di file 01 kamu sudah punya kerangka training loop. Sekarang waktunya mengisi otaknya — bagaimana neuron disusun jadi network, kapan pakai layer apa, dan bagaimana mencegah model "menghafal" data (overfitting). Tanpa pemahaman ini, kamu cuma bisa run kode orang lain tanpa tahu kapan harus tambah Dropout, kapan ganti ReLU jadi GELU, atau kenapa training-mu tidak konvergen.

Bayangkan neural network sebagai lasagna: tiap layer adalah lapisan dengan rasa berbeda — Linear (pasta polos), Conv (saus tomat berpola), BatchNorm (keju yang menyatukan rasa), Dropout (lubang udara biar tidak terlalu padat). Resep yang enak tergantung urutan dan komposisi yang tepat. Materi ini mengajarkan kamu jadi "chef" yang tahu kapan menambah lapisan apa.

Tiga skill kunci yang akan kamu kuasai: (1) Layer types — palet alat dasar untuk berbagai task, (2) Regularization — teknik anti-overfitting biar model bisa generalisasi ke data baru, dan (3) Training tricks — GPU, mixed precision, early stopping yang mempercepat hidup kamu.

Peta Mental: Anatomi Neural Network

Cara Membaca Diagram:

  • Atas-ke-bawah = forward pass dari input sampai loss
  • Tiap layer punya peran spesifik: transform, normalize, non-linear, regularize
  • Pola Linear → BatchNorm → Activation → Dropout adalah "klasik" yang sering muncul
  • Output layer biasanya tanpa activation (logits langsung ke loss)

Walkthrough Step-by-Step:

  1. Input ((batch, features)) — data mentah masuk
  2. Linear/Conv — transformasi linear (weight × input + bias)
  3. BatchNorm — normalize tiap feature, stabilkan training
  4. Activation (ReLU/GELU) — kasih non-linearity, tanpa ini network setara 1 layer
  5. Dropout — random matikan neuron saat train, anti-overfit
  6. Ulangi pola ini untuk layer-layer berikutnya
  7. Output Layer — produce logits sesuai task
  8. Loss — bandingkan prediksi dengan target

Analogi Sehari-hari: Neural network = lasagna berlapis. Linear = pasta polos (bawa data). BatchNorm = saus yang menyatukan rasa. Activation = bumbu yang kasih karakter. Dropout = lubang udara biar tidak terlalu padat. Tiap lapisan punya peran, urutan dan komposisi yang menentukan rasa akhir.

Diagram statis Mermaid sebagai fallback:

flowchart TD
    Input["📥 Input<br/>(batch, features)"] --> L1["🔷 Linear / Conv"]
    L1 --> BN["⚖️ BatchNorm<br/>(stabilize)"]
    BN --> Act["⚡ Activation<br/>(ReLU/GELU)"]
    Act --> Drop["🎲 Dropout<br/>(regularize)"]
    Drop --> L2["🔷 Linear / Conv"]
    L2 --> More["..."]
    More --> Out["🎯 Output Layer"]
    Out --> Loss["📊 Loss"]
    style Input fill:#dbeafe
    style BN fill:#fef3c7
    style Act fill:#fed7aa
    style Drop fill:#fce7f3
    style Out fill:#d1fae5
    style Loss fill:#fee2e2

Bagian 1 — Layer Types

Linear (Fully Connected)

nn.Linear(in_features, out_features, bias=True)
# Operasi: y = Wx + b

Conv 2D (untuk image)

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
# Input: (N, C, H, W)
# Output: (N, out_channels, H', W')

MaxPool / AvgPool

nn.MaxPool2d(kernel_size=2, stride=2)
nn.AdaptiveAvgPool2d((1, 1))    # output size fixed

BatchNorm

nn.BatchNorm1d(num_features)        # untuk linear
nn.BatchNorm2d(num_features)        # untuk conv

Normalize tiap batch — bantu training, allow higher LR.

Dropout

nn.Dropout(p=0.5)               # untuk linear
nn.Dropout2d(p=0.5)             # untuk conv

Random matikan neuron — anti overfitting.

Embedding

nn.Embedding(num_embeddings, embedding_dim)
# Lookup table, untuk token → vektor

LayerNorm (untuk Transformer)

nn.LayerNorm(normalized_shape)

Analogi BatchNorm vs LayerNorm: BatchNorm = "rata-rata sekelas dalam tiap mata pelajaran" (normalize per feature, across batch). LayerNorm = "rata-rata satu murid dalam semua mata pelajarannya" (normalize per sample, across features). LayerNorm tidak tergantung batch size → cocok untuk Transformer yang sequence length-nya variabel.

Layer Types Cheat Sheet

Layer Input shape Output shape Use case
nn.Linear(in, out) (*, in) (*, out) Klasifikasi, regresi
nn.Conv2d(C_in, C_out, k) (N, C_in, H, W) (N, C_out, H', W') Image
nn.MaxPool2d(k) (N, C, H, W) (N, C, H/k, W/k) Downsample
nn.BatchNorm2d(C) (N, C, H, W) sama Stabilize CNN
nn.LayerNorm(D) (N, T, D) sama Transformer
nn.Dropout(p) apa saja sama Regularization
nn.Embedding(V, D) (*,) int (*, D) Token → vektor
nn.LSTM(I, H) (N, T, I) (N, T, H) Sequence

Bagian 2 — Activation Functions

Analogi: Activation function = kran yang mengatur seberapa kuat sinyal lewat ke neuron berikutnya. Tanpa activation (cuma Linear → Linear → Linear), seluruh network sebenarnya setara satu Linear — hanya bisa belajar pola garis lurus. Activation non-linear adalah yang memberi network kemampuan "membengkokkan" decision boundary jadi pola kompleks.

Visual: Bentuk Activation

Cara Membaca Diagram:

  • Kolom kiri = nama activation, kanan = use case utama
  • Tiap row = pilihan independen untuk kebutuhan berbeda
  • Default modern hidden layer: ReLU, GELU (Transformer), atau SiLU (Llama)
  • Output layer: Sigmoid (binary), Softmax (multi-class)

Walkthrough Step-by-Step:

  1. ReLU = max(0, x) — pilih ini sebagai default kalau ragu
  2. LeakyReLU = bocor sedikit di negatif, anti dying ReLU di network dalam
  3. GELU = smooth ReLU, standar di BERT, GPT, dan Transformer modern
  4. Sigmoid = squash ke (0, 1), pakai di output binary classification
  5. Softmax = squash sehingga sum = 1, pakai di output multi-class

Analogi Sehari-hari: Activation = kran di pipa air. ReLU = kran yang cuma buka untuk tekanan positif. Sigmoid = kran yang stabil di range terbatas. Softmax = panel banyak kran yang totalnya tetap penuh. Pilih kran sesuai kebutuhan downstream-nya.

Diagram statis Mermaid sebagai fallback:

flowchart LR
    R["⚡ ReLU<br/>max(0,x)<br/>'matikan negatif'"] --> RU["✅ Default<br/>hidden layer"]
    L["⚡ LeakyReLU<br/>kebocoran kecil<br/>'jangan mati total'"] --> LU["✅ Anti dying<br/>ReLU"]
    G["⚡ GELU<br/>smooth ReLU<br/>'transisi mulus'"] --> GU["✅ Transformer<br/>(BERT, GPT)"]
    S["⚡ Sigmoid<br/>squash 0-1<br/>'probabilitas'"] --> SU["✅ Binary<br/>output"]
    SM["⚡ Softmax<br/>squash sum=1<br/>'distribusi'"] --> SMU["✅ Multi-class<br/>output"]
    style R fill:#fef3c7
    style L fill:#fde68a
    style G fill:#fcd34d
    style S fill:#fbbf24
    style SM fill:#f59e0b
import torch.nn.functional as F

F.relu(x)            # max(0, x)
F.leaky_relu(x, 0.1) # max(0.1x, x)
F.gelu(x)            # smooth ReLU (transformer)
F.sigmoid(x)         # 0-1
F.tanh(x)            # -1 to 1
F.softmax(x, dim=-1) # probabilitas distribusi

# Atau sebagai layer
nn.ReLU()
nn.GELU()
nn.Sigmoid()
nn.Softmax(dim=-1)
Activation Range Use Case
ReLU [0, ∞) Default hidden
LeakyReLU (-∞, ∞) Avoid dying ReLU
GELU smooth ReLU Transformer
Sigmoid (0, 1) Binary output
Tanh (-1, 1) RNN
Softmax (0, 1) sum=1 Multi-class output

Bagian 3 — Regularization

Analogi Overfitting: Murid yang menghafal soal latihan persis sampai jawabannya benar 100% — tapi pas ujian beneran (data baru), nilai jeblok karena soalnya sedikit beda. Regularization = teknik membuat murid paham konsep alih-alih menghafal. Dropout = "tidak boleh lihat sebagian catatan saat latihan", weight decay = "denda kalau pakai rumus terlalu rumit", early stopping = "selesai latihan saat sudah cukup, jangan kelewat batas".

Diagnosis Cepat: Overfit vs Underfit

flowchart LR
    Train["📈 Train Loss<br/>turun terus"] --> Cek{Val Loss?}
    Cek -->|"ikut turun"| Good["✅ Healthy<br/>lanjutkan"]
    Cek -->|"naik / stagnan"| Over["⚠️ Overfit<br/>tambah regularization"]
    Cek -->|"tinggi sejak awal"| Under["⚠️ Underfit<br/>model terlalu kecil<br/>atau LR salah"]
    style Good fill:#d1fae5
    style Over fill:#fee2e2
    style Under fill:#fef3c7

Dropout

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(100, 50)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(50, 10)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)    # random matikan 50%
        x = self.fc2(x)
        return x

# Penting: model.train() vs model.eval()
model.train()  # dropout aktif
model.eval()   # dropout off

Weight Decay (L2)

optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

Early Stopping

best_val_loss = float("inf")
patience = 5
counter = 0

for epoch in range(100):
    # train ...
    
    val_loss = validate(model, val_loader)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), "best.pt")
        counter = 0
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping!")
            break

Batch Normalization

Sudah disebut di atas — implicit regularization.


Bagian 4 — Inisialisasi Weight

# Manual
nn.init.xavier_uniform_(layer.weight)
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")

# Atau di constructor
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)
        nn.init.xavier_normal_(self.fc.weight)
        nn.init.zeros_(self.fc.bias)

Default PyTorch sudah bagus untuk kebanyakan kasus. Manual init kalau training tidak konvergen.


Bagian 5 — GPU Best Practice

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move model
model = model.to(device)

# Move data per batch
for batch_x, batch_y in loader:
    batch_x = batch_x.to(device)
    batch_y = batch_y.to(device)
    # ...

# Cek memory
torch.cuda.memory_allocated() / 1024**3  # GB

# Clear cache
torch.cuda.empty_cache()

Mixed Precision (Speed Up Training)

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch_x, batch_y in loader:
    optimizer.zero_grad()
    
    with autocast():
        output = model(batch_x)
        loss = criterion(output, batch_y)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Speed up 2-3x di GPU modern.


Bagian 6 — Image Classification Example

Klasifikasi MNIST dengan simple network:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_data = datasets.MNIST("./data", train=True, download=True, transform=transform)
test_data = datasets.MNIST("./data", train=False, transform=transform)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)

# Model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Linear(64, 10),
        )
    
    def forward(self, x):
        return self.fc(self.flatten(x))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MLP().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Train
for epoch in range(5):
    model.train()
    for batch_x, batch_y in train_loader:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)
        
        optimizer.zero_grad()
        output = model(batch_x)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()
    
    # Eval
    model.eval()
    correct = 0
    with torch.no_grad():
        for batch_x, batch_y in test_loader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            output = model(batch_x)
            pred = output.argmax(dim=1)
            correct += (pred == batch_y).sum().item()
    
    accuracy = correct / len(test_data)
    print(f"Epoch {epoch}: Acc {accuracy:.4f}")

Bagian 7 — Common Mistakes & FAQ

1. Lupa Toggle model.train() / model.eval()

# ❌ Eval pakai mode train → Dropout aktif, hasil random
correct = 0
for x, y in test_loader:
    pred = model(x).argmax(1)         # masih default mode train

# ✅
model.eval()
with torch.no_grad():
    for x, y in test_loader:
        pred = model(x).argmax(1)

Aturan main: model.train() di awal training loop, model.eval() di awal validation/inference.

2. Apply Softmax Sendiri Sebelum CrossEntropyLoss

# ❌ DOUBLE SOFTMAX — model tidak konvergen
logits = model(x)
probs = F.softmax(logits, dim=-1)
loss = nn.CrossEntropyLoss()(probs, labels)    # CE sudah include softmax!

# ✅
logits = model(x)
loss = nn.CrossEntropyLoss()(logits, labels)

3. Dropout di Output Layer

# ❌ Dropout di logits = noise di prediksi
self.out = nn.Sequential(nn.Linear(64, 10), nn.Dropout(0.5))

# ✅ Dropout cuma di hidden layer
self.hidden = nn.Sequential(nn.Linear(128, 64), nn.ReLU(), nn.Dropout(0.5))
self.out = nn.Linear(64, 10)

4. Learning Rate Terlalu Besar/Kecil

Symptom Diagnosis Fix
Loss = NaN setelah beberapa step LR terlalu besar / gradient explode Turunkan LR, pakai gradient clipping
Loss flat, tidak turun LR terlalu kecil / stuck di local min Naikkan LR, ganti optimizer
Loss turun lalu osilasi LR perlu schedule Pakai StepLR atau CosineAnnealingLR

5. Batch Size 1 + BatchNorm

# ❌ BatchNorm dengan batch_size=1 → variance NaN
loader = DataLoader(ds, batch_size=1)

# ✅ Pakai batch_size minimal 8, atau ganti BatchNorm dengan LayerNorm/GroupNorm
loader = DataLoader(ds, batch_size=32)

6. Forget .item() saat Logging

# ❌ Memory leak — graf komputasi nempel
losses.append(loss)

# ✅
losses.append(loss.item())

Bagian 8 — Activation Function Comparison Table (Kapan Pakai Apa?)

Activation Formula Pros Cons Best For
ReLU max(0, x) Cepat, default solid Dying ReLU (gradient 0 di negatif) Hidden CNN, MLP
LeakyReLU max(αx, x) Tidak mati total 1 hyperparam tambahan Hidden, GAN
GELU smooth ReLU Smooth, training stabil Lebih mahal Transformer (BERT, GPT)
SiLU/Swish x · sigmoid(x) Self-gated, smooth Lebih mahal Llama, modern arch
Tanh (e^x - e^-x)/(e^x + e^-x) Output centered di 0 Vanishing gradient di ekstrim RNN/LSTM gates
Sigmoid 1/(1+e^-x) Output 0-1 (probabilitas) Vanishing gradient Output binary
Softmax e^xi/Σe^xj Output sum = 1 Tidak untuk hidden Output multi-class

Cek Pemahaman

  • Tahu beda Linear, Conv2D, BatchNorm?
  • Tahu activation populer dan kapan pakai?
  • Bisa apply Dropout untuk regularization?
  • Bisa early stopping?
  • Bisa train di GPU?

Challenge 6.2

Challenge 1 — MNIST MLP

Replicate code di atas. Capai >97% accuracy.

Challenge 2 — Eksperimen Arsitektur

Coba variasi:

  • Tambah layer (3, 4, 5 layer)
  • Ubah hidden size (128, 256, 512)
  • Tambah/kurang dropout (0, 0.3, 0.5, 0.7)
  • Ganti activation (ReLU, GELU, LeakyReLU)

Plot accuracy comparison.

Challenge 3 — Loss & Accuracy Curves

Tracking train/val loss + accuracy. Plot per epoch. Diagnose overfit.


Selanjutnya: 03-cnn-rnn.md