02 — Neural Networks Mendalam
Estimasi: 6 jam Tujuan: Layer types, activation, regularization, dan teknik training neural network.
Kenapa Materi Ini Penting?
Di file 01 kamu sudah punya kerangka training loop. Sekarang waktunya mengisi otaknya — bagaimana neuron disusun jadi network, kapan pakai layer apa, dan bagaimana mencegah model "menghafal" data (overfitting). Tanpa pemahaman ini, kamu cuma bisa run kode orang lain tanpa tahu kapan harus tambah Dropout, kapan ganti ReLU jadi GELU, atau kenapa training-mu tidak konvergen.
Bayangkan neural network sebagai lasagna: tiap layer adalah lapisan dengan rasa berbeda — Linear (pasta polos), Conv (saus tomat berpola), BatchNorm (keju yang menyatukan rasa), Dropout (lubang udara biar tidak terlalu padat). Resep yang enak tergantung urutan dan komposisi yang tepat. Materi ini mengajarkan kamu jadi "chef" yang tahu kapan menambah lapisan apa.
Tiga skill kunci yang akan kamu kuasai: (1) Layer types — palet alat dasar untuk berbagai task, (2) Regularization — teknik anti-overfitting biar model bisa generalisasi ke data baru, dan (3) Training tricks — GPU, mixed precision, early stopping yang mempercepat hidup kamu.
Peta Mental: Anatomi Neural Network
Cara Membaca Diagram:
- Atas-ke-bawah = forward pass dari input sampai loss
- Tiap layer punya peran spesifik: transform, normalize, non-linear, regularize
- Pola Linear → BatchNorm → Activation → Dropout adalah "klasik" yang sering muncul
- Output layer biasanya tanpa activation (logits langsung ke loss)
Walkthrough Step-by-Step:
- Input (
(batch, features)) — data mentah masuk - Linear/Conv — transformasi linear (weight × input + bias)
- BatchNorm — normalize tiap feature, stabilkan training
- Activation (ReLU/GELU) — kasih non-linearity, tanpa ini network setara 1 layer
- Dropout — random matikan neuron saat train, anti-overfit
- Ulangi pola ini untuk layer-layer berikutnya
- Output Layer — produce logits sesuai task
- Loss — bandingkan prediksi dengan target
Analogi Sehari-hari: Neural network = lasagna berlapis. Linear = pasta polos (bawa data). BatchNorm = saus yang menyatukan rasa. Activation = bumbu yang kasih karakter. Dropout = lubang udara biar tidak terlalu padat. Tiap lapisan punya peran, urutan dan komposisi yang menentukan rasa akhir.
Diagram statis Mermaid sebagai fallback:
flowchart TD
Input["📥 Input<br/>(batch, features)"] --> L1["🔷 Linear / Conv"]
L1 --> BN["⚖️ BatchNorm<br/>(stabilize)"]
BN --> Act["⚡ Activation<br/>(ReLU/GELU)"]
Act --> Drop["🎲 Dropout<br/>(regularize)"]
Drop --> L2["🔷 Linear / Conv"]
L2 --> More["..."]
More --> Out["🎯 Output Layer"]
Out --> Loss["📊 Loss"]
style Input fill:#dbeafe
style BN fill:#fef3c7
style Act fill:#fed7aa
style Drop fill:#fce7f3
style Out fill:#d1fae5
style Loss fill:#fee2e2
Bagian 1 — Layer Types
Linear (Fully Connected)
nn.Linear(in_features, out_features, bias=True)
# Operasi: y = Wx + b
Conv 2D (untuk image)
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
# Input: (N, C, H, W)
# Output: (N, out_channels, H', W')
MaxPool / AvgPool
nn.MaxPool2d(kernel_size=2, stride=2)
nn.AdaptiveAvgPool2d((1, 1)) # output size fixed
BatchNorm
nn.BatchNorm1d(num_features) # untuk linear
nn.BatchNorm2d(num_features) # untuk conv
Normalize tiap batch — bantu training, allow higher LR.
Dropout
nn.Dropout(p=0.5) # untuk linear
nn.Dropout2d(p=0.5) # untuk conv
Random matikan neuron — anti overfitting.
Embedding
nn.Embedding(num_embeddings, embedding_dim)
# Lookup table, untuk token → vektor
LayerNorm (untuk Transformer)
nn.LayerNorm(normalized_shape)
Analogi BatchNorm vs LayerNorm: BatchNorm = "rata-rata sekelas dalam tiap mata pelajaran" (normalize per feature, across batch). LayerNorm = "rata-rata satu murid dalam semua mata pelajarannya" (normalize per sample, across features). LayerNorm tidak tergantung batch size → cocok untuk Transformer yang sequence length-nya variabel.
Layer Types Cheat Sheet
| Layer | Input shape | Output shape | Use case |
|---|---|---|---|
nn.Linear(in, out) |
(*, in) |
(*, out) |
Klasifikasi, regresi |
nn.Conv2d(C_in, C_out, k) |
(N, C_in, H, W) |
(N, C_out, H', W') |
Image |
nn.MaxPool2d(k) |
(N, C, H, W) |
(N, C, H/k, W/k) |
Downsample |
nn.BatchNorm2d(C) |
(N, C, H, W) |
sama | Stabilize CNN |
nn.LayerNorm(D) |
(N, T, D) |
sama | Transformer |
nn.Dropout(p) |
apa saja | sama | Regularization |
nn.Embedding(V, D) |
(*,) int |
(*, D) |
Token → vektor |
nn.LSTM(I, H) |
(N, T, I) |
(N, T, H) |
Sequence |
Bagian 2 — Activation Functions
Analogi: Activation function = kran yang mengatur seberapa kuat sinyal lewat ke neuron berikutnya. Tanpa activation (cuma Linear → Linear → Linear), seluruh network sebenarnya setara satu Linear — hanya bisa belajar pola garis lurus. Activation non-linear adalah yang memberi network kemampuan "membengkokkan" decision boundary jadi pola kompleks.
Visual: Bentuk Activation
Cara Membaca Diagram:
- Kolom kiri = nama activation, kanan = use case utama
- Tiap row = pilihan independen untuk kebutuhan berbeda
- Default modern hidden layer: ReLU, GELU (Transformer), atau SiLU (Llama)
- Output layer: Sigmoid (binary), Softmax (multi-class)
Walkthrough Step-by-Step:
- ReLU =
max(0, x)— pilih ini sebagai default kalau ragu - LeakyReLU = bocor sedikit di negatif, anti dying ReLU di network dalam
- GELU = smooth ReLU, standar di BERT, GPT, dan Transformer modern
- Sigmoid = squash ke (0, 1), pakai di output binary classification
- Softmax = squash sehingga sum = 1, pakai di output multi-class
Analogi Sehari-hari: Activation = kran di pipa air. ReLU = kran yang cuma buka untuk tekanan positif. Sigmoid = kran yang stabil di range terbatas. Softmax = panel banyak kran yang totalnya tetap penuh. Pilih kran sesuai kebutuhan downstream-nya.
Diagram statis Mermaid sebagai fallback:
flowchart LR
R["⚡ ReLU<br/>max(0,x)<br/>'matikan negatif'"] --> RU["✅ Default<br/>hidden layer"]
L["⚡ LeakyReLU<br/>kebocoran kecil<br/>'jangan mati total'"] --> LU["✅ Anti dying<br/>ReLU"]
G["⚡ GELU<br/>smooth ReLU<br/>'transisi mulus'"] --> GU["✅ Transformer<br/>(BERT, GPT)"]
S["⚡ Sigmoid<br/>squash 0-1<br/>'probabilitas'"] --> SU["✅ Binary<br/>output"]
SM["⚡ Softmax<br/>squash sum=1<br/>'distribusi'"] --> SMU["✅ Multi-class<br/>output"]
style R fill:#fef3c7
style L fill:#fde68a
style G fill:#fcd34d
style S fill:#fbbf24
style SM fill:#f59e0b
import torch.nn.functional as F
F.relu(x) # max(0, x)
F.leaky_relu(x, 0.1) # max(0.1x, x)
F.gelu(x) # smooth ReLU (transformer)
F.sigmoid(x) # 0-1
F.tanh(x) # -1 to 1
F.softmax(x, dim=-1) # probabilitas distribusi
# Atau sebagai layer
nn.ReLU()
nn.GELU()
nn.Sigmoid()
nn.Softmax(dim=-1)
| Activation | Range | Use Case |
|---|---|---|
| ReLU | [0, ∞) | Default hidden |
| LeakyReLU | (-∞, ∞) | Avoid dying ReLU |
| GELU | smooth ReLU | Transformer |
| Sigmoid | (0, 1) | Binary output |
| Tanh | (-1, 1) | RNN |
| Softmax | (0, 1) sum=1 | Multi-class output |
Bagian 3 — Regularization
Analogi Overfitting: Murid yang menghafal soal latihan persis sampai jawabannya benar 100% — tapi pas ujian beneran (data baru), nilai jeblok karena soalnya sedikit beda. Regularization = teknik membuat murid paham konsep alih-alih menghafal. Dropout = "tidak boleh lihat sebagian catatan saat latihan", weight decay = "denda kalau pakai rumus terlalu rumit", early stopping = "selesai latihan saat sudah cukup, jangan kelewat batas".
Diagnosis Cepat: Overfit vs Underfit
flowchart LR
Train["📈 Train Loss<br/>turun terus"] --> Cek{Val Loss?}
Cek -->|"ikut turun"| Good["✅ Healthy<br/>lanjutkan"]
Cek -->|"naik / stagnan"| Over["⚠️ Overfit<br/>tambah regularization"]
Cek -->|"tinggi sejak awal"| Under["⚠️ Underfit<br/>model terlalu kecil<br/>atau LR salah"]
style Good fill:#d1fae5
style Over fill:#fee2e2
style Under fill:#fef3c7
Dropout
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(100, 50)
self.dropout = nn.Dropout(0.5)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.dropout(x) # random matikan 50%
x = self.fc2(x)
return x
# Penting: model.train() vs model.eval()
model.train() # dropout aktif
model.eval() # dropout off
Weight Decay (L2)
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
Early Stopping
best_val_loss = float("inf")
patience = 5
counter = 0
for epoch in range(100):
# train ...
val_loss = validate(model, val_loader)
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), "best.pt")
counter = 0
else:
counter += 1
if counter >= patience:
print("Early stopping!")
break
Batch Normalization
Sudah disebut di atas — implicit regularization.
Bagian 4 — Inisialisasi Weight
# Manual
nn.init.xavier_uniform_(layer.weight)
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
# Atau di constructor
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(10, 5)
nn.init.xavier_normal_(self.fc.weight)
nn.init.zeros_(self.fc.bias)
Default PyTorch sudah bagus untuk kebanyakan kasus. Manual init kalau training tidak konvergen.
Bagian 5 — GPU Best Practice
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Move model
model = model.to(device)
# Move data per batch
for batch_x, batch_y in loader:
batch_x = batch_x.to(device)
batch_y = batch_y.to(device)
# ...
# Cek memory
torch.cuda.memory_allocated() / 1024**3 # GB
# Clear cache
torch.cuda.empty_cache()
Mixed Precision (Speed Up Training)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch_x, batch_y in loader:
optimizer.zero_grad()
with autocast():
output = model(batch_x)
loss = criterion(output, batch_y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Speed up 2-3x di GPU modern.
Bagian 6 — Image Classification Example
Klasifikasi MNIST dengan simple network:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Data
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST("./data", train=True, download=True, transform=transform)
test_data = datasets.MNIST("./data", train=False, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)
# Model
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.fc = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, 64),
nn.ReLU(),
nn.Linear(64, 10),
)
def forward(self, x):
return self.fc(self.flatten(x))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MLP().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Train
for epoch in range(5):
model.train()
for batch_x, batch_y in train_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
optimizer.zero_grad()
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
# Eval
model.eval()
correct = 0
with torch.no_grad():
for batch_x, batch_y in test_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
output = model(batch_x)
pred = output.argmax(dim=1)
correct += (pred == batch_y).sum().item()
accuracy = correct / len(test_data)
print(f"Epoch {epoch}: Acc {accuracy:.4f}")
Bagian 7 — Common Mistakes & FAQ
1. Lupa Toggle model.train() / model.eval()
# ❌ Eval pakai mode train → Dropout aktif, hasil random
correct = 0
for x, y in test_loader:
pred = model(x).argmax(1) # masih default mode train
# ✅
model.eval()
with torch.no_grad():
for x, y in test_loader:
pred = model(x).argmax(1)
Aturan main:
model.train()di awal training loop,model.eval()di awal validation/inference.
2. Apply Softmax Sendiri Sebelum CrossEntropyLoss
# ❌ DOUBLE SOFTMAX — model tidak konvergen
logits = model(x)
probs = F.softmax(logits, dim=-1)
loss = nn.CrossEntropyLoss()(probs, labels) # CE sudah include softmax!
# ✅
logits = model(x)
loss = nn.CrossEntropyLoss()(logits, labels)
3. Dropout di Output Layer
# ❌ Dropout di logits = noise di prediksi
self.out = nn.Sequential(nn.Linear(64, 10), nn.Dropout(0.5))
# ✅ Dropout cuma di hidden layer
self.hidden = nn.Sequential(nn.Linear(128, 64), nn.ReLU(), nn.Dropout(0.5))
self.out = nn.Linear(64, 10)
4. Learning Rate Terlalu Besar/Kecil
| Symptom | Diagnosis | Fix |
|---|---|---|
| Loss = NaN setelah beberapa step | LR terlalu besar / gradient explode | Turunkan LR, pakai gradient clipping |
| Loss flat, tidak turun | LR terlalu kecil / stuck di local min | Naikkan LR, ganti optimizer |
| Loss turun lalu osilasi | LR perlu schedule | Pakai StepLR atau CosineAnnealingLR |
5. Batch Size 1 + BatchNorm
# ❌ BatchNorm dengan batch_size=1 → variance NaN
loader = DataLoader(ds, batch_size=1)
# ✅ Pakai batch_size minimal 8, atau ganti BatchNorm dengan LayerNorm/GroupNorm
loader = DataLoader(ds, batch_size=32)
6. Forget .item() saat Logging
# ❌ Memory leak — graf komputasi nempel
losses.append(loss)
# ✅
losses.append(loss.item())
Bagian 8 — Activation Function Comparison Table (Kapan Pakai Apa?)
| Activation | Formula | Pros | Cons | Best For |
|---|---|---|---|---|
| ReLU | max(0, x) |
Cepat, default solid | Dying ReLU (gradient 0 di negatif) | Hidden CNN, MLP |
| LeakyReLU | max(αx, x) |
Tidak mati total | 1 hyperparam tambahan | Hidden, GAN |
| GELU | smooth ReLU | Smooth, training stabil | Lebih mahal | Transformer (BERT, GPT) |
| SiLU/Swish | x · sigmoid(x) |
Self-gated, smooth | Lebih mahal | Llama, modern arch |
| Tanh | (e^x - e^-x)/(e^x + e^-x) |
Output centered di 0 | Vanishing gradient di ekstrim | RNN/LSTM gates |
| Sigmoid | 1/(1+e^-x) |
Output 0-1 (probabilitas) | Vanishing gradient | Output binary |
| Softmax | e^xi/Σe^xj |
Output sum = 1 | Tidak untuk hidden | Output multi-class |
Cek Pemahaman
- Tahu beda Linear, Conv2D, BatchNorm?
- Tahu activation populer dan kapan pakai?
- Bisa apply Dropout untuk regularization?
- Bisa early stopping?
- Bisa train di GPU?
Challenge 6.2
Challenge 1 — MNIST MLP
Replicate code di atas. Capai >97% accuracy.
Challenge 2 — Eksperimen Arsitektur
Coba variasi:
- Tambah layer (3, 4, 5 layer)
- Ubah hidden size (128, 256, 512)
- Tambah/kurang dropout (0, 0.3, 0.5, 0.7)
- Ganti activation (ReLU, GELU, LeakyReLU)
Plot accuracy comparison.
Challenge 3 — Loss & Accuracy Curves
Tracking train/val loss + accuracy. Plot per epoch. Diagnose overfit.
Selanjutnya: 03-cnn-rnn.md