06 — Hugging Face Ecosystem
Estimasi: 6 jam Tujuan: Pakai pretrained model untuk task nyata. "GitHub-nya AI" — wajib bisa.
Kenapa Materi Ini Penting?
Analogi: Hugging Face = GitHub-nya model AI. Kalau di GitHub kamu bisa download repo code orang dengan
git clone, di Hugging Face kamu bisa download model AI senilai jutaan dolar training cost dengan satu barisfrom_pretrained(). Mau BERT, GPT-2, Llama, Whisper, Stable Diffusion? Semua satu line away. Skip download = skip era modern AI.
Sebelum Hugging Face (2018), pakai pretrained model adalah mimpi buruk: tiap riset publish format sendiri, code-nya berantakan, replikasi sulit. Hugging Face menyatukan semuanya jadi API tunggal yang konsisten — AutoTokenizer.from_pretrained(name) + AutoModel.from_pretrained(name) works for ribuan model. Ini library yang akan kamu pakai setiap hari sebagai GenAI engineer.
Tiga skill kunci yang akan kamu kuasai: (1) pipeline() untuk solusi 1-line di banyak task NLP, (2) AutoTokenizer + AutoModel untuk kontrol penuh, dan (3) Trainer + datasets untuk fine-tune model ke task spesifik kamu (termasuk Bahasa Indonesia).
Peta Mental: Hugging Face Workflow
Cara Membaca Diagram:
- Atas = Hub + load, kiri = input text, kanan = hasil akhir
from_pretrained()download Tokenizer dan Model dari Hub- Tokenizer ubah text → IDs, Model proses IDs → output
- Pola sama untuk ribuan model: BERT, GPT-2, Llama, dll
Walkthrough Step-by-Step:
- HF Hub punya ribuan model siap pakai
AutoTokenizer.from_pretrained(name)+AutoModel.from_pretrained(name)download cache lokal- Tokenizer ubah text "I love AI" jadi token IDs
[101, 1045, 2293, ...] - Model terima IDs, produce logits / generated text
- Decode hasilnya jadi output user-friendly
Analogi Sehari-hari: Hugging Face = GitHub-nya AI. Mau model BERT? from_pretrained() = git clone. Sekarang kamu pakai model jutaan dolar training cost dengan satu baris.
Diagram statis Mermaid sebagai fallback:
flowchart LR
Hub["🤗 HF Hub<br/>(ribuan model)"] --> DL["⬇️ from_pretrained()"]
DL --> Tok["✂️ Tokenizer"]
DL --> Mdl["🧠 Model"]
Text["📝 Input text"] --> Tok
Tok --> IDs["🔢 Token IDs"]
IDs --> Mdl
Mdl --> Out["🎯 Logits/Generation"]
Out --> Result["✅ Final result"]
style Hub fill:#fef3c7
style Tok fill:#dbeafe
style Mdl fill:#fed7aa
style Result fill:#d1fae5
Bagian 1 — Setup
pip install transformers datasets accelerate sentencepiece
from transformers import AutoTokenizer, AutoModel, pipeline
Akun di huggingface.co (gratis). Setup token kalau perlu access gated models.
Bagian 2 — Pipeline (Termudah)
Pipeline = abstraksi tertinggi, 1 line solusi.
from transformers import pipeline
# Sentiment analysis
sentiment = pipeline("sentiment-analysis")
print(sentiment("I love this product!"))
# [{'label': 'POSITIVE', 'score': 0.999}]
# Text classification (custom model)
classifier = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")
print(classifier("This is bad"))
# NER
ner = pipeline("ner", grouped_entities=True)
print(ner("Steve Jobs founded Apple in California"))
# Summarization
summ = pipeline("summarization")
print(summ("Long article text here ...", max_length=50))
# Translation
trans = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en")
print(trans("Saya suka belajar"))
# Text generation
gen = pipeline("text-generation", model="gpt2")
print(gen("Once upon a time", max_length=50))
# Question answering
qa = pipeline("question-answering")
print(qa(question="Where is Apple?", context="Apple is in California"))
# Zero-shot classification
zsc = pipeline("zero-shot-classification")
print(zsc("Saya senang sekali", candidate_labels=["positive", "negative", "neutral"]))
Pipeline = production-ready dalam 1 line. Tapi kurang fleksibel.
Bagian 3 — Tokenizer + Model Manual
Lebih kontrol:
Analogi Manual vs Pipeline:
pipeline()= mesin kopi otomatis — pencet tombol, kopi jadi.Tokenizer + Modelmanual = espresso machine pro — atur grind size, tekanan, suhu sendiri. Untuk eksperimen serius, fine-tuning, atau output kustom, kamu butuh kontrol manual.
Visualisasi Inference Pipeline (Manual)
flowchart LR
T["📝 'I love AI'"] --> Tok["✂️ Tokenizer"]
Tok --> IDs["🔢 input_ids<br/>[101, 1045, 2293, ...]"]
Tok --> Mask["🎭 attention_mask<br/>[1, 1, 1, ...]"]
IDs --> M["🧠 Model"]
Mask --> M
M --> L["📊 Logits<br/>(batch, classes)"]
L --> SM["softmax"]
SM --> P["📈 Probs"]
P --> Lab["🏷️ id2label"]
Lab --> R["✅ 'POSITIVE'"]
style T fill:#dbeafe
style M fill:#fed7aa
style R fill:#d1fae5
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenize
text = "I love AI"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
print(inputs)
# {'input_ids': tensor([...]), 'attention_mask': tensor([...])}
# Inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=-1)
pred = probs.argmax(dim=-1)
print(probs)
print(model.config.id2label[pred.item()]) # "POSITIVE"
Bagian 4 — Models Categories
AutoClass (Auto-detect Model Type)
AutoTokenizer— tokenizer apapunAutoModel— base model (no head)AutoModelForSequenceClassification— classificationAutoModelForTokenClassification— NERAutoModelForQuestionAnswering— QAAutoModelForCausalLM— text generation (GPT-style)AutoModelForMaskedLM— fill-mask (BERT-style)AutoModelForSeq2SeqLM— translation, summarization
Popular Pretrained Models
| Model | Size | Best for |
|---|---|---|
bert-base-uncased |
110M | English, classification |
distilbert-base |
67M | Faster BERT |
roberta-base |
125M | Better BERT |
xlm-roberta-base |
125M | Multilingual (incl. Indonesia) |
bert-base-multilingual |
110M | Multilingual |
indobert-base-p1 |
- | Indonesian-specific |
gpt2 |
124M | Generation, English |
t5-small |
60M | Seq2seq |
flan-t5-base |
250M | Better instruction-following T5 |
Bagian 5 — Datasets Library
from datasets import load_dataset
# Built-in
ds = load_dataset("imdb")
print(ds)
# DatasetDict({
# train: Dataset (25000 rows)
# test: Dataset (25000 rows)
# })
print(ds["train"][0])
# {'text': '...', 'label': 1}
# Filter
positive = ds["train"].filter(lambda x: x["label"] == 1)
# Map (transform)
def add_length(example):
example["length"] = len(example["text"])
return example
ds = ds.map(add_length)
Indonesian Datasets
ds = load_dataset("indonlp/indonlu", "smsa") # sentiment
ds = load_dataset("id_clickbait") # clickbait
ds = load_dataset("indonesian_news") # news
Bagian 6 — Fine-Tuning Pretrained Model
Analogi Fine-Tuning: Bayangkan model pretrained = lulusan S1 umum yang sudah baca jutaan buku. Fine-tuning = kursus spesialisasi 3 bulan untuk task spesifik (sentiment Indonesia, klasifikasi medical, dll). Kamu tidak perlu mulai dari TK lagi — modal pengetahuan umum yang sudah ada, tinggal poles ke domain baru. Hasilnya: akurasi tinggi dengan dataset jauh lebih kecil dibanding train from scratch.
Visualisasi Fine-Tuning Workflow
Cara Membaca Diagram:
- Kiri atas = pretrained model (general knowledge), kiri bawah = dataset baru
- Tengah = Trainer yang gabungkan keduanya (few epochs cukup)
- Kanan = fine-tuned model siap untuk task spesifik
- Save / push_to_hub = share ke HF Hub atau simpan lokal
Walkthrough Step-by-Step:
- Pretrained model sudah dilatih di jutaan dokumen (BERT, IndoBERT, dll)
from_pretrained()load model + weights- Custom dataset (SMSA, IMDB, dll) di-tokenize
- Trainer dipakai untuk fine-tune dengan 2-5 epoch (jauh lebih cepat dari training nol)
- Hasilnya fine-tuned model spesialis untuk task kamu
save_pretrainedataupush_to_hubuntuk persist
Analogi Sehari-hari: Pretrained = lulusan S1 umum yang sudah baca jutaan buku. Fine-tuning = kursus spesialisasi 3 bulan untuk task spesifik. Tidak perlu mulai dari TK lagi — modal pengetahuan umum yang sudah ada, tinggal poles.
Diagram statis Mermaid sebagai fallback:
flowchart LR
PT["🎓 Pretrained<br/>BERT/GPT<br/>(general knowledge)"] --> Load["⬇️ load_pretrained"]
Data["📚 Custom dataset<br/>(SMSA, IMDB, ...)"] --> Tok["✂️ Tokenize"]
Tok --> Train["🏋️ Trainer<br/>(few epochs)"]
Load --> Train
Train --> FT["🎯 Fine-tuned model<br/>(specialized)"]
FT --> Save["💾 save / push_to_hub"]
style PT fill:#fef3c7
style Train fill:#dbeafe
style FT fill:#d1fae5
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load
tokenizer = AutoTokenizer.from_pretrained("indobert-base-p1")
model = AutoModelForSequenceClassification.from_pretrained("indobert-base-p1", num_labels=3)
# Dataset
ds = load_dataset("indonlp/indonlu", "smsa")
def tokenize_fn(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
ds_tokenized = ds.map(tokenize_fn, batched=True)
# Training arguments
args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# Trainer
trainer = Trainer(
model=model,
args=args,
train_dataset=ds_tokenized["train"],
eval_dataset=ds_tokenized["validation"],
)
trainer.train()
trainer.evaluate()
# Save
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")
Push to Hub
trainer.push_to_hub("yazid/my-sentiment-model")
# Reload anywhere
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("yazid/my-sentiment-model")
Bagian 7 — Embedding & Sentence Transformers
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2") # 384-dim
# Atau Indonesia: "indobenchmark/indobert-base-p1"
texts = ["Saya suka kucing", "Anjing itu lucu", "Cuaca hari ini cerah"]
embeddings = model.encode(texts)
print(embeddings.shape) # (3, 384)
Similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(embeddings)
print(sim)
# 3x3 matrix similarity antar texts
Wajib paham: ini adalah fondasi RAG di Fase 7.
Bagian 8 — Generate dengan LLM Open Source
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model kecil
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float16)
# Generate
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
do_sample=True,
top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Untuk Model Besar — Pakai Ollama
# Install ollama (lokal)
# Download dari ollama.com
ollama run llama3.2 # 3B params, jalan di laptop biasa
ollama run mistral # 7B
import requests
response = requests.post("http://localhost:11434/api/generate",
json={"model": "llama3.2", "prompt": "Hello", "stream": False})
print(response.json()["response"])
Bagian 9 — Best Practices
Saat Pakai Pretrained
- Cek license — beberapa restricted (Llama, dll)
- Cek model card — paham capabilities dan limitations
- Test dulu di sample kecil
- Pakai model size yang cukup — gak perlu LLama-70B kalau task simple
Saat Fine-Tune
- Mulai dari model kecil — distilBERT, T5-small
- Freeze sebagian layer kalau data sedikit
- Track loss dengan TensorBoard / W&B
- Save checkpoints sering
Production
- Quantization untuk speed up — int8, int4
- ONNX/TorchScript export untuk deployment
- Batch inference di production
Bagian 9 — Common Mistakes & FAQ
1. model.eval() Tidak Dipanggil Saat Inference
# ❌ Hasil bisa berbeda tiap call (Dropout aktif)
model = AutoModelForSequenceClassification.from_pretrained(name)
output = model(**inputs)
# ✅
model.eval()
with torch.no_grad():
output = model(**inputs)
Saat
from_pretrained(), model belum otomatis di mode eval. Kamu wajib panggil sendiri.
2. Lupa Pindah Model & Input ke GPU
# ❌ Model di GPU, input di CPU → device mismatch
model = model.to("cuda")
output = model(**inputs) # error
# ✅
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
3. Truncation Tidak Aktif → Sequence Terlalu Panjang
# ❌ Long text > max position embeddings (512 di BERT) → error
inputs = tokenizer(very_long_text, return_tensors="pt")
# ✅
inputs = tokenizer(very_long_text, truncation=True, max_length=512,
padding=True, return_tensors="pt")
4. Pakai Wrong AutoModel Class
# ❌ AutoModel = base model tanpa head klasifikasi
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased") # tidak ada classifier!
# ✅ Pakai class spesifik
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(name, num_labels=2)
5. Tokenizer & Model Beda Versi
# ❌ Tokenizer BERT, model RoBERTa → vocab mismatch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base")
# ✅ Selalu pakai nama yang sama untuk tokenizer & model
name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
6. Cache Model Penuh-in Disk
# Default cache: ~/.cache/huggingface/hub (Windows: %USERPROFILE%\.cache\huggingface\hub)
# Bisa ber-GB cepat. Cek & cleanup:
huggingface-cli scan-cache
huggingface-cli delete-cache
7. Tidak Lock Versi Library
# ❌ requirements.txt terlalu loose
transformers
datasets
# ✅ Pin versi untuk reproducibility
transformers==4.46.0
datasets==3.1.0
torch==2.5.1
Hugging Face library berkembang cepat — API kadang break. Pin versi save your day.
8. Push to Hub Tanpa Login
# ❌ Tidak login dulu
trainer.push_to_hub("my-model") # error: 401 Unauthorized
# ✅
huggingface-cli login # masukkan token dari https://huggingface.co/settings/tokens
Bagian 10 — AutoModel Class Cheat Sheet
| Class | Use Case | Input | Output |
|---|---|---|---|
AutoModel |
Base, ekstraksi feature | tokens | hidden states |
AutoModelForSequenceClassification |
Sentiment, topic | tokens | logits per class |
AutoModelForTokenClassification |
NER, POS | tokens | logits per token |
AutoModelForQuestionAnswering |
QA extractive | question + context | start/end positions |
AutoModelForCausalLM |
GPT-style generation | prompt | next token logits |
AutoModelForMaskedLM |
Fill-mask (BERT-style) | text dengan [MASK] | logits per masked pos |
AutoModelForSeq2SeqLM |
Translation, summarization | source | target tokens |
Cek Pemahaman
- Bisa pakai pipeline untuk task NLP?
- Bisa load tokenizer dan model manual?
- Tahu kapan pakai model classification vs causal vs seq2seq?
- Bisa fine-tune pretrained model dengan Trainer?
- Bisa generate embedding dengan sentence-transformers?
- Tahu cara push model ke HF Hub?
Challenge 6.6
Challenge 1 — Sentiment Pipeline
Pakai pipeline untuk sentiment analysis:
- 5 review Indonesia
- 5 review Inggris
- Bandingkan akurasi 2 model berbeda
Challenge 2 — Fine-Tune untuk Indonesia
- Dataset IndoNLU SMSA (sentiment)
- Fine-tune
indobert-base-p1 - Evaluasi
- Push ke HF Hub
Challenge 3 — Embedding Search
- Generate embedding untuk 1000 dokumen
- Pakai cosine similarity untuk find similar
- Bikin query interface
Challenge 4 — Generate dengan LLM Open
Install Ollama. Jalankan llama3.2 lokal. Bikin chatbot CLI sederhana.
Selanjutnya: 07-llm-api.md