Hugging Face Ecosystem

6 jam11 min baca
Tujuan

Pakai pretrained model untuk task nyata. "GitHub-nya AI" — wajib bisa.

06 — Hugging Face Ecosystem

Estimasi: 6 jam Tujuan: Pakai pretrained model untuk task nyata. "GitHub-nya AI" — wajib bisa.


Kenapa Materi Ini Penting?

Analogi: Hugging Face = GitHub-nya model AI. Kalau di GitHub kamu bisa download repo code orang dengan git clone, di Hugging Face kamu bisa download model AI senilai jutaan dolar training cost dengan satu baris from_pretrained(). Mau BERT, GPT-2, Llama, Whisper, Stable Diffusion? Semua satu line away. Skip download = skip era modern AI.

Sebelum Hugging Face (2018), pakai pretrained model adalah mimpi buruk: tiap riset publish format sendiri, code-nya berantakan, replikasi sulit. Hugging Face menyatukan semuanya jadi API tunggal yang konsisten — AutoTokenizer.from_pretrained(name) + AutoModel.from_pretrained(name) works for ribuan model. Ini library yang akan kamu pakai setiap hari sebagai GenAI engineer.

Tiga skill kunci yang akan kamu kuasai: (1) pipeline() untuk solusi 1-line di banyak task NLP, (2) AutoTokenizer + AutoModel untuk kontrol penuh, dan (3) Trainer + datasets untuk fine-tune model ke task spesifik kamu (termasuk Bahasa Indonesia).

Peta Mental: Hugging Face Workflow

Cara Membaca Diagram:

  • Atas = Hub + load, kiri = input text, kanan = hasil akhir
  • from_pretrained() download Tokenizer dan Model dari Hub
  • Tokenizer ubah text → IDs, Model proses IDs → output
  • Pola sama untuk ribuan model: BERT, GPT-2, Llama, dll

Walkthrough Step-by-Step:

  1. HF Hub punya ribuan model siap pakai
  2. AutoTokenizer.from_pretrained(name) + AutoModel.from_pretrained(name) download cache lokal
  3. Tokenizer ubah text "I love AI" jadi token IDs [101, 1045, 2293, ...]
  4. Model terima IDs, produce logits / generated text
  5. Decode hasilnya jadi output user-friendly

Analogi Sehari-hari: Hugging Face = GitHub-nya AI. Mau model BERT? from_pretrained() = git clone. Sekarang kamu pakai model jutaan dolar training cost dengan satu baris.

Diagram statis Mermaid sebagai fallback:

flowchart LR
    Hub["🤗 HF Hub<br/>(ribuan model)"] --> DL["⬇️ from_pretrained()"]
    DL --> Tok["✂️ Tokenizer"]
    DL --> Mdl["🧠 Model"]
    Text["📝 Input text"] --> Tok
    Tok --> IDs["🔢 Token IDs"]
    IDs --> Mdl
    Mdl --> Out["🎯 Logits/Generation"]
    Out --> Result["✅ Final result"]
    style Hub fill:#fef3c7
    style Tok fill:#dbeafe
    style Mdl fill:#fed7aa
    style Result fill:#d1fae5

Bagian 1 — Setup

pip install transformers datasets accelerate sentencepiece
from transformers import AutoTokenizer, AutoModel, pipeline

Akun di huggingface.co (gratis). Setup token kalau perlu access gated models.


Bagian 2 — Pipeline (Termudah)

Pipeline = abstraksi tertinggi, 1 line solusi.

from transformers import pipeline

# Sentiment analysis
sentiment = pipeline("sentiment-analysis")
print(sentiment("I love this product!"))
# [{'label': 'POSITIVE', 'score': 0.999}]

# Text classification (custom model)
classifier = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")
print(classifier("This is bad"))

# NER
ner = pipeline("ner", grouped_entities=True)
print(ner("Steve Jobs founded Apple in California"))

# Summarization
summ = pipeline("summarization")
print(summ("Long article text here ...", max_length=50))

# Translation
trans = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en")
print(trans("Saya suka belajar"))

# Text generation
gen = pipeline("text-generation", model="gpt2")
print(gen("Once upon a time", max_length=50))

# Question answering
qa = pipeline("question-answering")
print(qa(question="Where is Apple?", context="Apple is in California"))

# Zero-shot classification
zsc = pipeline("zero-shot-classification")
print(zsc("Saya senang sekali", candidate_labels=["positive", "negative", "neutral"]))

Pipeline = production-ready dalam 1 line. Tapi kurang fleksibel.


Bagian 3 — Tokenizer + Model Manual

Lebih kontrol:

Analogi Manual vs Pipeline: pipeline() = mesin kopi otomatis — pencet tombol, kopi jadi. Tokenizer + Model manual = espresso machine pro — atur grind size, tekanan, suhu sendiri. Untuk eksperimen serius, fine-tuning, atau output kustom, kamu butuh kontrol manual.

Visualisasi Inference Pipeline (Manual)

flowchart LR
    T["📝 'I love AI'"] --> Tok["✂️ Tokenizer"]
    Tok --> IDs["🔢 input_ids<br/>[101, 1045, 2293, ...]"]
    Tok --> Mask["🎭 attention_mask<br/>[1, 1, 1, ...]"]
    IDs --> M["🧠 Model"]
    Mask --> M
    M --> L["📊 Logits<br/>(batch, classes)"]
    L --> SM["softmax"]
    SM --> P["📈 Probs"]
    P --> Lab["🏷️ id2label"]
    Lab --> R["✅ 'POSITIVE'"]
    style T fill:#dbeafe
    style M fill:#fed7aa
    style R fill:#d1fae5
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenize
text = "I love AI"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
print(inputs)
# {'input_ids': tensor([...]), 'attention_mask': tensor([...])}

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=-1)
    pred = probs.argmax(dim=-1)
    
print(probs)
print(model.config.id2label[pred.item()])    # "POSITIVE"

Bagian 4 — Models Categories

AutoClass (Auto-detect Model Type)

  • AutoTokenizer — tokenizer apapun
  • AutoModel — base model (no head)
  • AutoModelForSequenceClassification — classification
  • AutoModelForTokenClassification — NER
  • AutoModelForQuestionAnswering — QA
  • AutoModelForCausalLM — text generation (GPT-style)
  • AutoModelForMaskedLM — fill-mask (BERT-style)
  • AutoModelForSeq2SeqLM — translation, summarization
Model Size Best for
bert-base-uncased 110M English, classification
distilbert-base 67M Faster BERT
roberta-base 125M Better BERT
xlm-roberta-base 125M Multilingual (incl. Indonesia)
bert-base-multilingual 110M Multilingual
indobert-base-p1 - Indonesian-specific
gpt2 124M Generation, English
t5-small 60M Seq2seq
flan-t5-base 250M Better instruction-following T5

Bagian 5 — Datasets Library

from datasets import load_dataset

# Built-in
ds = load_dataset("imdb")
print(ds)
# DatasetDict({
#     train: Dataset (25000 rows)
#     test: Dataset (25000 rows)
# })

print(ds["train"][0])
# {'text': '...', 'label': 1}

# Filter
positive = ds["train"].filter(lambda x: x["label"] == 1)

# Map (transform)
def add_length(example):
    example["length"] = len(example["text"])
    return example

ds = ds.map(add_length)

Indonesian Datasets

ds = load_dataset("indonlp/indonlu", "smsa")     # sentiment
ds = load_dataset("id_clickbait")                 # clickbait
ds = load_dataset("indonesian_news")              # news

Bagian 6 — Fine-Tuning Pretrained Model

Analogi Fine-Tuning: Bayangkan model pretrained = lulusan S1 umum yang sudah baca jutaan buku. Fine-tuning = kursus spesialisasi 3 bulan untuk task spesifik (sentiment Indonesia, klasifikasi medical, dll). Kamu tidak perlu mulai dari TK lagi — modal pengetahuan umum yang sudah ada, tinggal poles ke domain baru. Hasilnya: akurasi tinggi dengan dataset jauh lebih kecil dibanding train from scratch.

Visualisasi Fine-Tuning Workflow

Cara Membaca Diagram:

  • Kiri atas = pretrained model (general knowledge), kiri bawah = dataset baru
  • Tengah = Trainer yang gabungkan keduanya (few epochs cukup)
  • Kanan = fine-tuned model siap untuk task spesifik
  • Save / push_to_hub = share ke HF Hub atau simpan lokal

Walkthrough Step-by-Step:

  1. Pretrained model sudah dilatih di jutaan dokumen (BERT, IndoBERT, dll)
  2. from_pretrained() load model + weights
  3. Custom dataset (SMSA, IMDB, dll) di-tokenize
  4. Trainer dipakai untuk fine-tune dengan 2-5 epoch (jauh lebih cepat dari training nol)
  5. Hasilnya fine-tuned model spesialis untuk task kamu
  6. save_pretrained atau push_to_hub untuk persist

Analogi Sehari-hari: Pretrained = lulusan S1 umum yang sudah baca jutaan buku. Fine-tuning = kursus spesialisasi 3 bulan untuk task spesifik. Tidak perlu mulai dari TK lagi — modal pengetahuan umum yang sudah ada, tinggal poles.

Diagram statis Mermaid sebagai fallback:

flowchart LR
    PT["🎓 Pretrained<br/>BERT/GPT<br/>(general knowledge)"] --> Load["⬇️ load_pretrained"]
    Data["📚 Custom dataset<br/>(SMSA, IMDB, ...)"] --> Tok["✂️ Tokenize"]
    Tok --> Train["🏋️ Trainer<br/>(few epochs)"]
    Load --> Train
    Train --> FT["🎯 Fine-tuned model<br/>(specialized)"]
    FT --> Save["💾 save / push_to_hub"]
    style PT fill:#fef3c7
    style Train fill:#dbeafe
    style FT fill:#d1fae5
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Load
tokenizer = AutoTokenizer.from_pretrained("indobert-base-p1")
model = AutoModelForSequenceClassification.from_pretrained("indobert-base-p1", num_labels=3)

# Dataset
ds = load_dataset("indonlp/indonlu", "smsa")

def tokenize_fn(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

ds_tokenized = ds.map(tokenize_fn, batched=True)

# Training arguments
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds_tokenized["train"],
    eval_dataset=ds_tokenized["validation"],
)

trainer.train()
trainer.evaluate()

# Save
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")

Push to Hub

trainer.push_to_hub("yazid/my-sentiment-model")

# Reload anywhere
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("yazid/my-sentiment-model")

Bagian 7 — Embedding & Sentence Transformers

pip install sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")    # 384-dim
# Atau Indonesia: "indobenchmark/indobert-base-p1"

texts = ["Saya suka kucing", "Anjing itu lucu", "Cuaca hari ini cerah"]
embeddings = model.encode(texts)
print(embeddings.shape)    # (3, 384)

Similarity

from sklearn.metrics.pairwise import cosine_similarity

sim = cosine_similarity(embeddings)
print(sim)
# 3x3 matrix similarity antar texts

Wajib paham: ini adalah fondasi RAG di Fase 7.


Bagian 8 — Generate dengan LLM Open Source

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model kecil
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float16)

# Generate
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    do_sample=True,
    top_p=0.9,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Untuk Model Besar — Pakai Ollama

# Install ollama (lokal)
# Download dari ollama.com

ollama run llama3.2          # 3B params, jalan di laptop biasa
ollama run mistral            # 7B
import requests

response = requests.post("http://localhost:11434/api/generate",
    json={"model": "llama3.2", "prompt": "Hello", "stream": False})
print(response.json()["response"])

Bagian 9 — Best Practices

Saat Pakai Pretrained

  1. Cek license — beberapa restricted (Llama, dll)
  2. Cek model card — paham capabilities dan limitations
  3. Test dulu di sample kecil
  4. Pakai model size yang cukup — gak perlu LLama-70B kalau task simple

Saat Fine-Tune

  1. Mulai dari model kecil — distilBERT, T5-small
  2. Freeze sebagian layer kalau data sedikit
  3. Track loss dengan TensorBoard / W&B
  4. Save checkpoints sering

Production

  1. Quantization untuk speed up — int8, int4
  2. ONNX/TorchScript export untuk deployment
  3. Batch inference di production

Bagian 9 — Common Mistakes & FAQ

1. model.eval() Tidak Dipanggil Saat Inference

# ❌ Hasil bisa berbeda tiap call (Dropout aktif)
model = AutoModelForSequenceClassification.from_pretrained(name)
output = model(**inputs)

# ✅
model.eval()
with torch.no_grad():
    output = model(**inputs)

Saat from_pretrained(), model belum otomatis di mode eval. Kamu wajib panggil sendiri.

2. Lupa Pindah Model & Input ke GPU

# ❌ Model di GPU, input di CPU → device mismatch
model = model.to("cuda")
output = model(**inputs)        # error

# ✅
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

3. Truncation Tidak Aktif → Sequence Terlalu Panjang

# ❌ Long text > max position embeddings (512 di BERT) → error
inputs = tokenizer(very_long_text, return_tensors="pt")

# ✅
inputs = tokenizer(very_long_text, truncation=True, max_length=512,
                   padding=True, return_tensors="pt")

4. Pakai Wrong AutoModel Class

# ❌ AutoModel = base model tanpa head klasifikasi
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")    # tidak ada classifier!

# ✅ Pakai class spesifik
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(name, num_labels=2)

5. Tokenizer & Model Beda Versi

# ❌ Tokenizer BERT, model RoBERTa → vocab mismatch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base")

# ✅ Selalu pakai nama yang sama untuk tokenizer & model
name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)

6. Cache Model Penuh-in Disk

# Default cache: ~/.cache/huggingface/hub (Windows: %USERPROFILE%\.cache\huggingface\hub)
# Bisa ber-GB cepat. Cek & cleanup:

huggingface-cli scan-cache
huggingface-cli delete-cache

7. Tidak Lock Versi Library

# ❌ requirements.txt terlalu loose
transformers
datasets

# ✅ Pin versi untuk reproducibility
transformers==4.46.0
datasets==3.1.0
torch==2.5.1

Hugging Face library berkembang cepat — API kadang break. Pin versi save your day.

8. Push to Hub Tanpa Login

# ❌ Tidak login dulu
trainer.push_to_hub("my-model")    # error: 401 Unauthorized

# ✅
huggingface-cli login    # masukkan token dari https://huggingface.co/settings/tokens

Bagian 10 — AutoModel Class Cheat Sheet

Class Use Case Input Output
AutoModel Base, ekstraksi feature tokens hidden states
AutoModelForSequenceClassification Sentiment, topic tokens logits per class
AutoModelForTokenClassification NER, POS tokens logits per token
AutoModelForQuestionAnswering QA extractive question + context start/end positions
AutoModelForCausalLM GPT-style generation prompt next token logits
AutoModelForMaskedLM Fill-mask (BERT-style) text dengan [MASK] logits per masked pos
AutoModelForSeq2SeqLM Translation, summarization source target tokens

Cek Pemahaman

  • Bisa pakai pipeline untuk task NLP?
  • Bisa load tokenizer dan model manual?
  • Tahu kapan pakai model classification vs causal vs seq2seq?
  • Bisa fine-tune pretrained model dengan Trainer?
  • Bisa generate embedding dengan sentence-transformers?
  • Tahu cara push model ke HF Hub?

Challenge 6.6

Challenge 1 — Sentiment Pipeline

Pakai pipeline untuk sentiment analysis:

  • 5 review Indonesia
  • 5 review Inggris
  • Bandingkan akurasi 2 model berbeda

Challenge 2 — Fine-Tune untuk Indonesia

  1. Dataset IndoNLU SMSA (sentiment)
  2. Fine-tune indobert-base-p1
  3. Evaluasi
  4. Push ke HF Hub
  1. Generate embedding untuk 1000 dokumen
  2. Pakai cosine similarity untuk find similar
  3. Bikin query interface

Challenge 4 — Generate dengan LLM Open

Install Ollama. Jalankan llama3.2 lokal. Bikin chatbot CLI sederhana.


Selanjutnya: 07-llm-api.md