03 — LLMOps: Deploy, Monitor, & Cost Management
Estimasi: 10 jam Prasyarat: 01-llm-evaluation.md, 02-guardrails-safety.md Tujuan: Setelah ini kamu bisa deploy LLM app ke production, monitor performanya, dan manage cost supaya tidak jebol budget.
Kenapa Materi Ini Penting?
Kamu sudah bisa bikin RAG chatbot yang bagus dan aman. Tapi kalau cuma jalan di laptop kamu — siapa yang pakai? LLMOps adalah skill yang mengubah "project portfolio" jadi "product yang dipakai orang".
Recruiter tidak cuma tanya "bisa bikin chatbot?". Mereka tanya: "Pernah deploy? Gimana handle 1000 user? Berapa cost per query?"
Bagian 1 — Arsitektur Production LLM App
Dari Demo ke Production
Cara Membaca Diagram:
- Ungu kiri = entry point: user → frontend.
- Cyan = backend API (FastAPI) sebagai orchestrator.
- Amber atas/bawah = sidecar services: Guardrails, Auth/RateLimit.
- Pink tengah = LLM Router yang pilih model.
- Pink kanan = multiple LLM providers (OpenAI, Claude, Local).
- Cyan bawah = data layer: Vector DB + Cache.
- Emerald = monitoring (LangSmith).
Walkthrough Step-by-Step:
- User klik tombol di frontend (Next.js/React).
- Request masuk ke FastAPI backend.
- Auth/RateLimit cek user authorization & quota.
- Guardrails cek input (sanitize, detect injection).
- API panggil LLM Router untuk pilih model (cheap untuk simple, mahal untuk complex).
- Pararel: Vector DB query untuk RAG context.
- Cache check dulu sebelum LLM call.
- LLM (OpenAI/Claude/Local) jawab.
- Output guardrails + monitoring log everything.
- Response kembali ke user.
Analogi Sehari-hari: Seperti restoran fine-dining penuh staff. Bukan cuma chef + waiter — ada manager (API), security (auth), QC food safety (guardrails), maitre d' (router), specialist chef per cuisine (LLM providers), pantry/freezer (cache + vector DB), CCTV (monitoring). Production = banyak peran spesialis, bukan one-man show.
Diagram statis Mermaid sebagai fallback:
flowchart LR
subgraph Demo["🧪 Demo (Fase 7)"]
A["Streamlit"] --> B["LangChain"]
B --> C["OpenAI API"]
B --> D["Chroma (local)"]
end
subgraph Prod["🚀 Production (Fase 7B)"]
E["Frontend<br/>(React/Next.js)"] --> F["Backend API<br/>(FastAPI)"]
F --> G["LLM Router"]
G --> H["OpenAI / Claude / Local"]
F --> I["Vector DB<br/>(managed)"]
F --> J["Cache Layer"]
F --> K["Monitoring"]
end
Komponen Production Stack
| Komponen | Demo | Production |
|---|---|---|
| Frontend | Streamlit | React/Next.js atau dedicated chat UI |
| Backend | LangChain langsung | FastAPI + LangChain/LlamaIndex |
| LLM | 1 provider | Multiple providers + fallback |
| Vector DB | Chroma (in-memory) | Pinecone/Weaviate/Qdrant (managed) |
| Cache | Tidak ada | Redis / semantic cache |
| Auth | Tidak ada | API keys / OAuth |
| Monitoring | print() | LangSmith / LangFuse / custom |
| Deploy | localhost | Cloud (Railway, Render, AWS, GCP) |
Bagian 2 — Deployment Options
Option 1: Platform-as-a-Service (Paling Mudah)
Cocok untuk: MVP, side project, portfolio demo
| Platform | Kelebihan | Harga |
|---|---|---|
| Railway | Deploy dari GitHub, auto-scale | 🆓 tier / 💰 $5+/bulan |
| Render | Simple, free tier | 🆓 tier / 💰 $7+/bulan |
| Fly.io | Global edge, Docker-based | 🆓 tier / 💰 $5+/bulan |
| Hugging Face Spaces | Gratis untuk Streamlit/Gradio | 🆓 / 💰 untuk GPU |
| Vercel | Frontend + serverless functions | 🆓 tier / 💰 $20+/bulan |
Contoh deploy ke Railway:
# Install Railway CLI
npm install -g @railway/cli
# Login
railway login
# Init project
railway init
# Deploy
railway up
Option 2: Container-based (Lebih Kontrol)
Cocok untuk: Tim kecil, butuh customization
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# Build & run locally
docker build -t my-llm-app .
docker run -p 8000:8000 --env-file .env my-llm-app
# Push ke registry
docker push myregistry/my-llm-app:latest
Option 3: Cloud Provider (Enterprise)
Cocok untuk: Skala besar, compliance ketat
- AWS: ECS/EKS + Lambda + Bedrock
- GCP: Cloud Run + Vertex AI
- Azure: Container Apps + Azure OpenAI
Untuk level junior: mulai dari Option 1 (Railway/Render). Naik ke Option 2 kalau butuh. Option 3 nanti kalau sudah kerja di perusahaan.
Bagian 3 — Backend API dengan FastAPI
Kenapa FastAPI?
- Async native (penting untuk LLM calls yang lambat)
- Auto-generate API docs (Swagger)
- Type-safe dengan Pydantic
- Populer di ML/AI community
Struktur Project Production
my-llm-app/
├── app/
│ ├── main.py # FastAPI app
│ ├── routers/
│ │ ├── chat.py # Chat endpoints
│ │ └── health.py # Health check
│ ├── services/
│ │ ├── llm.py # LLM interaction
│ │ ├── retriever.py # RAG retrieval
│ │ └── guardrails.py # Safety checks
│ ├── models/
│ │ └── schemas.py # Request/Response models
│ └── config.py # Settings
├── tests/
├── Dockerfile
├── requirements.txt
└── .env
Contoh Implementasi
# app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.routers import chat, health
app = FastAPI(title="My LLM App", version="1.0.0")
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
app.include_router(health.router)
app.include_router(chat.router, prefix="/api")
# app/routers/chat.py
from fastapi import APIRouter, HTTPException
from app.models.schemas import ChatRequest, ChatResponse
from app.services.llm import get_answer
from app.services.guardrails import check_input, check_output
router = APIRouter()
@router.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
# Input guardrail
input_check = check_input(request.message)
if not input_check.safe:
return ChatResponse(
answer="Maaf, saya tidak bisa memproses permintaan tersebut.",
sources=[],
)
# Get answer from RAG
result = await get_answer(request.message)
# Output guardrail
output_check = check_output(result.answer)
if not output_check.safe:
return ChatResponse(
answer="Maaf, terjadi masalah. Silakan coba pertanyaan lain.",
sources=[],
)
return ChatResponse(
answer=result.answer,
sources=result.sources,
)
# app/models/schemas.py
from pydantic import BaseModel
class ChatRequest(BaseModel):
message: str
session_id: str | None = None
class ChatResponse(BaseModel):
answer: str
sources: list[str]
Bagian 4 — Caching (Hemat Cost & Latency)
Kenapa Cache Penting?
- LLM API call = lambat (1-10 detik) dan mahal ($0.01-0.10 per query)
- Banyak user tanya hal yang sama/mirip
- Cache bisa hemat 50-80% cost dan 90% latency untuk repeated queries
Jenis Cache
Cara Membaca Diagram:
- Ungu kiri = user query.
- Cyan = 3 jalur paralel: exact cache, semantic cache, miss path.
- Emerald = cache hits (cepat, gratis).
- Pink = cache miss → LLM call (lambat, mahal).
- Amber = store di cache untuk next time.
- Emerald kanan = response final.
Walkthrough Step-by-Step:
- Query datang. Hash query (lowercase, trim) → cek Redis.
- Exact hit: hash sama → return cached answer dalam 0ms, $0.
- Exact miss → Semantic check: embed query → cari similarity > 0.9 di cache.
- Semantic hit: ada query mirip → return jawaban sebelumnya, ~50ms, $0.
- Full miss: tidak ada di cache → call LLM (2-5s, $0.03 per query).
- Setelah LLM jawab, store di cache (key = hash, value = answer, TTL = 1 jam).
- Response ke user. Next time query mirip → hit cache.
Analogi Sehari-hari: Seperti FAQ di customer service. Pertanyaan persis (exact) langsung dijawab dari script. Pertanyaan mirip (semantic) dicarikan FAQ yang related. Pertanyaan baru = harus konsultasi senior (LLM call), lalu dicatat untuk FAQ next time.
Diagram statis Mermaid sebagai fallback:
flowchart TD
Q["User Query"] --> C{"Cache Hit?"}
C -->|Exact match| R1["✅ Return cached<br/>(0ms, $0)"]
C -->|Semantic match| R2["✅ Return similar cached<br/>(50ms, $0)"]
C -->|Miss| L["🤖 Call LLM<br/>(2-5s, $0.03)"]
L --> S["💾 Store in cache"]
S --> R3["Return fresh answer"]
1. Exact Cache (Redis)
import redis
import hashlib
import json
cache = redis.Redis(host="localhost", port=6379)
def get_cached_answer(question: str) -> str | None:
key = hashlib.md5(question.strip().lower().encode()).hexdigest()
cached = cache.get(f"llm:{key}")
if cached:
return json.loads(cached)["answer"]
return None
def cache_answer(question: str, answer: str, ttl: int = 3600):
key = hashlib.md5(question.strip().lower().encode()).hexdigest()
cache.setex(f"llm:{key}", ttl, json.dumps({"answer": answer}))
2. Semantic Cache (GPTCache)
Pertanyaan berbeda tapi makna sama → return cache:
- "Berapa harga BPJS kelas 1?" ≈ "Iuran BPJS kelas satu berapa ya?"
from gptcache import cache
from gptcache.embedding import Onnx
from gptcache.similarity_evaluation import SearchDistanceEvaluation
# Setup semantic cache
onnx = Onnx()
cache.init(
embedding_func=onnx.to_embeddings,
similarity_evaluation=SearchDistanceEvaluation(),
)
# Threshold: similarity > 0.9 = cache hit
cache.set_openai_key()
Bagian 5 — Monitoring & Observability
Apa yang Harus Dimonitor?
| Metrik | Kenapa Penting | Target |
|---|---|---|
| Latency | User experience | < 3 detik p95 |
| Error rate | Reliability | < 1% |
| Token usage | Cost | Track per user/endpoint |
| Hallucination rate | Quality | < 5% (dari eval) |
| User satisfaction | Business value | Thumbs up > 80% |
| Cache hit rate | Efficiency | > 30% |
Tools Monitoring
flowchart LR
App["LLM App"] --> LS["LangSmith<br/>(tracing)"]
App --> LF["LangFuse<br/>(observability)"]
App --> P["Prometheus<br/>(metrics)"]
P --> G["Grafana<br/>(dashboard)"]
App --> S["Sentry<br/>(errors)"]
LangSmith Tracing (Recommended)
# .env
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_key
LANGCHAIN_PROJECT=my-rag-chatbot
# Otomatis trace semua LangChain calls!
# Buka smith.langchain.com untuk lihat:
# - Setiap step di chain
# - Input/output per step
# - Latency per step
# - Token usage
# - Error traces
Custom Logging
import time
import logging
from dataclasses import dataclass
logger = logging.getLogger(__name__)
@dataclass
class QueryMetrics:
question: str
answer: str
latency_ms: float
tokens_used: int
cache_hit: bool
sources_count: int
timestamp: str
async def tracked_query(question: str) -> ChatResponse:
start = time.time()
# Check cache
cached = get_cached_answer(question)
if cached:
latency = (time.time() - start) * 1000
log_metrics(QueryMetrics(
question=question, answer=cached,
latency_ms=latency, tokens_used=0,
cache_hit=True, sources_count=0,
timestamp=datetime.now().isoformat(),
))
return ChatResponse(answer=cached, sources=[])
# Call LLM
result = await get_answer(question)
latency = (time.time() - start) * 1000
log_metrics(QueryMetrics(
question=question, answer=result.answer,
latency_ms=latency, tokens_used=result.tokens,
cache_hit=False, sources_count=len(result.sources),
timestamp=datetime.now().isoformat(),
))
# Cache result
cache_answer(question, result.answer)
return ChatResponse(answer=result.answer, sources=result.sources)
Bagian 6 — Cost Management
Berapa Biaya LLM di Production?
| Model | Input (per 1M token) | Output (per 1M token) | Typical query cost |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | ~$0.01-0.05 |
| GPT-4o-mini | $0.15 | $0.60 | ~$0.001-0.005 |
| Claude Sonnet | $3.00 | $15.00 | ~$0.01-0.08 |
| Claude Haiku | $0.25 | $1.25 | ~$0.001-0.005 |
| Gemini Flash | $0.075 | $0.30 | ~$0.0005-0.002 |
Contoh kalkulasi:
- 1000 queries/hari × $0.03/query = $30/hari = $900/bulan
- Dengan cache (50% hit rate): $450/bulan
- Dengan model routing (simple → cheap, complex → expensive): $200/bulan
Strategi Hemat Cost
1. Model Routing
def route_to_model(question: str) -> str:
"""Route simple questions to cheap model, complex to expensive."""
# Heuristic: panjang pertanyaan, keyword complexity
if len(question.split()) < 10 and not any(
kw in question.lower() for kw in ["jelaskan", "bandingkan", "analisis"]
):
return "gpt-4o-mini" # Simple question → cheap model
return "gpt-4o" # Complex → powerful model
# Atau pakai LLM classifier
ROUTER_PROMPT = """Classify this question complexity:
- SIMPLE: factual, short answer expected
- COMPLEX: needs reasoning, comparison, analysis
Question: {question}
Classification:"""
2. Prompt Optimization
# BURUK: prompt panjang, banyak token terbuang
bad_prompt = """
Kamu adalah asisten AI yang sangat pintar dan membantu. Kamu selalu
menjawab dengan sopan dan detail. Kamu harus memastikan jawabanmu
akurat dan berdasarkan fakta. Jika kamu tidak tahu, bilang tidak tahu.
Berikut adalah konteks yang relevan untuk menjawab pertanyaan user:
{context}
Berdasarkan konteks di atas, jawab pertanyaan berikut dengan lengkap:
{question}
"""
# BAIK: concise, same quality
good_prompt = """Context: {context}
Q: {question}
A (based on context only):"""
3. Token Budgeting
MAX_MONTHLY_BUDGET = 100 # USD
daily_budget = MAX_MONTHLY_BUDGET / 30
current_daily_spend = get_today_spend()
if current_daily_spend > daily_budget * 0.8:
# Switch to cheaper model
model = "gpt-4o-mini"
elif current_daily_spend > daily_budget:
# Rate limit or queue
raise HTTPException(429, "Daily budget exceeded. Try again tomorrow.")
4. Chunking Strategy (RAG)
# Smaller chunks = less tokens in context = cheaper
# But too small = miss context
# Sweet spot: 300-500 tokens per chunk, overlap 50
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
)
# Retrieve fewer but more relevant chunks
retriever = vectorstore.as_retriever(
search_kwargs={"k": 3} # 3 chunks, not 10
)
Bagian 7 — Scaling & Reliability
Rate Limiting
from fastapi import Request
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
@router.post("/chat")
@limiter.limit("10/minute") # Max 10 queries per minute per IP
async def chat(request: Request, body: ChatRequest):
...
Fallback Strategy
async def get_llm_response(prompt: str) -> str:
providers = [
("openai", "gpt-4o"),
("anthropic", "claude-sonnet"),
("openai", "gpt-4o-mini"), # Fallback to cheaper
]
for provider, model in providers:
try:
return await call_llm(provider, model, prompt)
except (RateLimitError, TimeoutError, APIError) as e:
logger.warning(f"{provider}/{model} failed: {e}")
continue
return "Maaf, layanan sedang tidak tersedia. Silakan coba lagi nanti."
Health Check
# app/routers/health.py
@router.get("/health")
async def health():
checks = {
"llm": await check_llm_connection(),
"vectordb": await check_vectordb_connection(),
"cache": check_cache_connection(),
}
all_healthy = all(checks.values())
return {
"status": "healthy" if all_healthy else "degraded",
"checks": checks,
}
Bagian 8 — CI/CD untuk LLM Apps
Pipeline
flowchart LR
C["Code Push"] --> T["Unit Tests"]
T --> E["Eval Tests<br/>(RAGAS)"]
E --> B["Build Docker"]
B --> S["Deploy Staging"]
S --> M["Manual QA"]
M --> P["Deploy Production"]
P --> Mon["Monitor"]
GitHub Actions Example
# .github/workflows/deploy.yml
name: Deploy LLM App
on:
push:
branches: [main]
jobs:
test-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install deps
run: pip install -r requirements.txt
- name: Run unit tests
run: pytest tests/unit/
- name: Run eval tests
run: python eval_pipeline.py --threshold 0.8
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Deploy to Railway
if: success()
run: railway up --service my-llm-app
env:
RAILWAY_TOKEN: ${{ secrets.RAILWAY_TOKEN }}
Kesalahpahaman Umum
❌ "Deploy = upload ke server, selesai" → Deploy baru awal. Monitor, iterate, maintain — itu yang ongoing.
❌ "LLM app tidak perlu unit test" → Tetap perlu. Test guardrails, test routing logic, test API contracts. Eval test tambahan di atas unit test.
❌ "Pakai model termahal = hasil terbaik" → Untuk 80% queries, model murah sudah cukup. Route pintar hemat 70% cost.
❌ "Cache tidak berguna karena setiap pertanyaan unik" → Semantic cache menangkap pertanyaan yang mirip. Hit rate 30-50% sangat realistis.
❌ "Scaling = butuh Kubernetes" → Untuk < 10k users/hari, PaaS (Railway/Render) sudah cukup. Jangan over-engineer.
Cek Pemahaman
- Apa beda arsitektur demo vs production LLM app?
- Sebut 3 deployment option dan kapan pakai masing-masing
- Kenapa caching penting? Apa beda exact vs semantic cache?
- Sebut 5 metrik yang harus dimonitor di production
- Bagaimana model routing bisa hemat cost?
- Apa itu fallback strategy dan kenapa penting?
Challenge 7B.3
Challenge 1 — Deploy ke Cloud (Wajib)
Deploy RAG chatbot dari Fase 7 ke Railway atau Render. Pastikan:
- Bisa diakses via URL publik
- Health check endpoint jalan
- Environment variables aman (tidak hardcode API key)
Challenge 2 — Tambah Caching (Sedang)
Implementasikan exact cache dengan Redis (atau dict in-memory untuk demo). Ukur:
- Latency dengan cache vs tanpa cache
- Berapa % queries yang hit cache setelah 50 queries
Challenge 3 — Monitoring Dashboard (Sedang)
Tambahkan logging ke setiap query. Buat simple dashboard (bisa pakai Streamlit) yang menampilkan:
- Total queries hari ini
- Average latency
- Cache hit rate
- Top 5 pertanyaan paling sering
Challenge 4 — Cost Optimization (Sulit)
Implementasikan model routing: simple questions → GPT-4o-mini, complex → GPT-4o. Bandingkan:
- Total cost per 100 queries (sebelum vs sesudah routing)
- Quality score (dari eval) — apakah turun signifikan?
Challenge 5 — Full Production Stack (Sangat Sulit)
Gabungkan semua: FastAPI backend + guardrails + caching + monitoring + deploy. Ini bisa jadi portfolio project utama kamu.
Quote Penutup
"Everyone wants to build AI. Few want to operate AI."
Skill operasional (deploy, monitor, cost) adalah yang membedakan engineer yang bisa ship dari yang cuma bisa prototype. Kamu sekarang punya keduanya.
Selamat! Kamu sudah menyelesaikan seluruh materi persiapan. Kamu siap untuk bootcamp Dicoding dengan fondasi yang jauh lebih kuat dari rata-rata peserta.