LLMOps: Deploy, Monitor, & Cost Management

10 jam14 min baca
Tujuan

Setelah ini kamu bisa deploy LLM app ke production, monitor performanya, dan manage cost supaya tidak jebol budget.

03 — LLMOps: Deploy, Monitor, & Cost Management

Estimasi: 10 jam Prasyarat: 01-llm-evaluation.md, 02-guardrails-safety.md Tujuan: Setelah ini kamu bisa deploy LLM app ke production, monitor performanya, dan manage cost supaya tidak jebol budget.


Kenapa Materi Ini Penting?

Kamu sudah bisa bikin RAG chatbot yang bagus dan aman. Tapi kalau cuma jalan di laptop kamu — siapa yang pakai? LLMOps adalah skill yang mengubah "project portfolio" jadi "product yang dipakai orang".

Recruiter tidak cuma tanya "bisa bikin chatbot?". Mereka tanya: "Pernah deploy? Gimana handle 1000 user? Berapa cost per query?"


Bagian 1 — Arsitektur Production LLM App

Dari Demo ke Production

Cara Membaca Diagram:

  • Ungu kiri = entry point: user → frontend.
  • Cyan = backend API (FastAPI) sebagai orchestrator.
  • Amber atas/bawah = sidecar services: Guardrails, Auth/RateLimit.
  • Pink tengah = LLM Router yang pilih model.
  • Pink kanan = multiple LLM providers (OpenAI, Claude, Local).
  • Cyan bawah = data layer: Vector DB + Cache.
  • Emerald = monitoring (LangSmith).

Walkthrough Step-by-Step:

  1. User klik tombol di frontend (Next.js/React).
  2. Request masuk ke FastAPI backend.
  3. Auth/RateLimit cek user authorization & quota.
  4. Guardrails cek input (sanitize, detect injection).
  5. API panggil LLM Router untuk pilih model (cheap untuk simple, mahal untuk complex).
  6. Pararel: Vector DB query untuk RAG context.
  7. Cache check dulu sebelum LLM call.
  8. LLM (OpenAI/Claude/Local) jawab.
  9. Output guardrails + monitoring log everything.
  10. Response kembali ke user.

Analogi Sehari-hari: Seperti restoran fine-dining penuh staff. Bukan cuma chef + waiter — ada manager (API), security (auth), QC food safety (guardrails), maitre d' (router), specialist chef per cuisine (LLM providers), pantry/freezer (cache + vector DB), CCTV (monitoring). Production = banyak peran spesialis, bukan one-man show.

Diagram statis Mermaid sebagai fallback:

flowchart LR
    subgraph Demo["🧪 Demo (Fase 7)"]
        A["Streamlit"] --> B["LangChain"]
        B --> C["OpenAI API"]
        B --> D["Chroma (local)"]
    end
    
    subgraph Prod["🚀 Production (Fase 7B)"]
        E["Frontend<br/>(React/Next.js)"] --> F["Backend API<br/>(FastAPI)"]
        F --> G["LLM Router"]
        G --> H["OpenAI / Claude / Local"]
        F --> I["Vector DB<br/>(managed)"]
        F --> J["Cache Layer"]
        F --> K["Monitoring"]
    end

Komponen Production Stack

Komponen Demo Production
Frontend Streamlit React/Next.js atau dedicated chat UI
Backend LangChain langsung FastAPI + LangChain/LlamaIndex
LLM 1 provider Multiple providers + fallback
Vector DB Chroma (in-memory) Pinecone/Weaviate/Qdrant (managed)
Cache Tidak ada Redis / semantic cache
Auth Tidak ada API keys / OAuth
Monitoring print() LangSmith / LangFuse / custom
Deploy localhost Cloud (Railway, Render, AWS, GCP)

Bagian 2 — Deployment Options

Option 1: Platform-as-a-Service (Paling Mudah)

Cocok untuk: MVP, side project, portfolio demo

Platform Kelebihan Harga
Railway Deploy dari GitHub, auto-scale 🆓 tier / 💰 $5+/bulan
Render Simple, free tier 🆓 tier / 💰 $7+/bulan
Fly.io Global edge, Docker-based 🆓 tier / 💰 $5+/bulan
Hugging Face Spaces Gratis untuk Streamlit/Gradio 🆓 / 💰 untuk GPU
Vercel Frontend + serverless functions 🆓 tier / 💰 $20+/bulan

Contoh deploy ke Railway:

# Install Railway CLI
npm install -g @railway/cli

# Login
railway login

# Init project
railway init

# Deploy
railway up

Option 2: Container-based (Lebih Kontrol)

Cocok untuk: Tim kecil, butuh customization

# Dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# Build & run locally
docker build -t my-llm-app .
docker run -p 8000:8000 --env-file .env my-llm-app

# Push ke registry
docker push myregistry/my-llm-app:latest

Option 3: Cloud Provider (Enterprise)

Cocok untuk: Skala besar, compliance ketat

  • AWS: ECS/EKS + Lambda + Bedrock
  • GCP: Cloud Run + Vertex AI
  • Azure: Container Apps + Azure OpenAI

Untuk level junior: mulai dari Option 1 (Railway/Render). Naik ke Option 2 kalau butuh. Option 3 nanti kalau sudah kerja di perusahaan.


Bagian 3 — Backend API dengan FastAPI

Kenapa FastAPI?

  • Async native (penting untuk LLM calls yang lambat)
  • Auto-generate API docs (Swagger)
  • Type-safe dengan Pydantic
  • Populer di ML/AI community

Struktur Project Production

my-llm-app/
├── app/
│   ├── main.py              # FastAPI app
│   ├── routers/
│   │   ├── chat.py          # Chat endpoints
│   │   └── health.py        # Health check
│   ├── services/
│   │   ├── llm.py           # LLM interaction
│   │   ├── retriever.py     # RAG retrieval
│   │   └── guardrails.py    # Safety checks
│   ├── models/
│   │   └── schemas.py       # Request/Response models
│   └── config.py            # Settings
├── tests/
├── Dockerfile
├── requirements.txt
└── .env

Contoh Implementasi

# app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.routers import chat, health

app = FastAPI(title="My LLM App", version="1.0.0")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

app.include_router(health.router)
app.include_router(chat.router, prefix="/api")
# app/routers/chat.py
from fastapi import APIRouter, HTTPException
from app.models.schemas import ChatRequest, ChatResponse
from app.services.llm import get_answer
from app.services.guardrails import check_input, check_output

router = APIRouter()

@router.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    # Input guardrail
    input_check = check_input(request.message)
    if not input_check.safe:
        return ChatResponse(
            answer="Maaf, saya tidak bisa memproses permintaan tersebut.",
            sources=[],
        )
    
    # Get answer from RAG
    result = await get_answer(request.message)
    
    # Output guardrail
    output_check = check_output(result.answer)
    if not output_check.safe:
        return ChatResponse(
            answer="Maaf, terjadi masalah. Silakan coba pertanyaan lain.",
            sources=[],
        )
    
    return ChatResponse(
        answer=result.answer,
        sources=result.sources,
    )
# app/models/schemas.py
from pydantic import BaseModel

class ChatRequest(BaseModel):
    message: str
    session_id: str | None = None

class ChatResponse(BaseModel):
    answer: str
    sources: list[str]

Bagian 4 — Caching (Hemat Cost & Latency)

Kenapa Cache Penting?

  • LLM API call = lambat (1-10 detik) dan mahal ($0.01-0.10 per query)
  • Banyak user tanya hal yang sama/mirip
  • Cache bisa hemat 50-80% cost dan 90% latency untuk repeated queries

Jenis Cache

Cara Membaca Diagram:

  • Ungu kiri = user query.
  • Cyan = 3 jalur paralel: exact cache, semantic cache, miss path.
  • Emerald = cache hits (cepat, gratis).
  • Pink = cache miss → LLM call (lambat, mahal).
  • Amber = store di cache untuk next time.
  • Emerald kanan = response final.

Walkthrough Step-by-Step:

  1. Query datang. Hash query (lowercase, trim) → cek Redis.
  2. Exact hit: hash sama → return cached answer dalam 0ms, $0.
  3. Exact miss → Semantic check: embed query → cari similarity > 0.9 di cache.
  4. Semantic hit: ada query mirip → return jawaban sebelumnya, ~50ms, $0.
  5. Full miss: tidak ada di cache → call LLM (2-5s, $0.03 per query).
  6. Setelah LLM jawab, store di cache (key = hash, value = answer, TTL = 1 jam).
  7. Response ke user. Next time query mirip → hit cache.

Analogi Sehari-hari: Seperti FAQ di customer service. Pertanyaan persis (exact) langsung dijawab dari script. Pertanyaan mirip (semantic) dicarikan FAQ yang related. Pertanyaan baru = harus konsultasi senior (LLM call), lalu dicatat untuk FAQ next time.

Diagram statis Mermaid sebagai fallback:

flowchart TD
    Q["User Query"] --> C{"Cache Hit?"}
    C -->|Exact match| R1["✅ Return cached<br/>(0ms, $0)"]
    C -->|Semantic match| R2["✅ Return similar cached<br/>(50ms, $0)"]
    C -->|Miss| L["🤖 Call LLM<br/>(2-5s, $0.03)"]
    L --> S["💾 Store in cache"]
    S --> R3["Return fresh answer"]

1. Exact Cache (Redis)

import redis
import hashlib
import json

cache = redis.Redis(host="localhost", port=6379)

def get_cached_answer(question: str) -> str | None:
    key = hashlib.md5(question.strip().lower().encode()).hexdigest()
    cached = cache.get(f"llm:{key}")
    if cached:
        return json.loads(cached)["answer"]
    return None

def cache_answer(question: str, answer: str, ttl: int = 3600):
    key = hashlib.md5(question.strip().lower().encode()).hexdigest()
    cache.setex(f"llm:{key}", ttl, json.dumps({"answer": answer}))

2. Semantic Cache (GPTCache)

Pertanyaan berbeda tapi makna sama → return cache:

  • "Berapa harga BPJS kelas 1?" ≈ "Iuran BPJS kelas satu berapa ya?"
from gptcache import cache
from gptcache.embedding import Onnx
from gptcache.similarity_evaluation import SearchDistanceEvaluation

# Setup semantic cache
onnx = Onnx()
cache.init(
    embedding_func=onnx.to_embeddings,
    similarity_evaluation=SearchDistanceEvaluation(),
)

# Threshold: similarity > 0.9 = cache hit
cache.set_openai_key()

Bagian 5 — Monitoring & Observability

Apa yang Harus Dimonitor?

Metrik Kenapa Penting Target
Latency User experience < 3 detik p95
Error rate Reliability < 1%
Token usage Cost Track per user/endpoint
Hallucination rate Quality < 5% (dari eval)
User satisfaction Business value Thumbs up > 80%
Cache hit rate Efficiency > 30%

Tools Monitoring

flowchart LR
    App["LLM App"] --> LS["LangSmith<br/>(tracing)"]
    App --> LF["LangFuse<br/>(observability)"]
    App --> P["Prometheus<br/>(metrics)"]
    P --> G["Grafana<br/>(dashboard)"]
    App --> S["Sentry<br/>(errors)"]
# .env
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_key
LANGCHAIN_PROJECT=my-rag-chatbot

# Otomatis trace semua LangChain calls!
# Buka smith.langchain.com untuk lihat:
# - Setiap step di chain
# - Input/output per step
# - Latency per step
# - Token usage
# - Error traces

Custom Logging

import time
import logging
from dataclasses import dataclass

logger = logging.getLogger(__name__)

@dataclass
class QueryMetrics:
    question: str
    answer: str
    latency_ms: float
    tokens_used: int
    cache_hit: bool
    sources_count: int
    timestamp: str

async def tracked_query(question: str) -> ChatResponse:
    start = time.time()
    
    # Check cache
    cached = get_cached_answer(question)
    if cached:
        latency = (time.time() - start) * 1000
        log_metrics(QueryMetrics(
            question=question, answer=cached,
            latency_ms=latency, tokens_used=0,
            cache_hit=True, sources_count=0,
            timestamp=datetime.now().isoformat(),
        ))
        return ChatResponse(answer=cached, sources=[])
    
    # Call LLM
    result = await get_answer(question)
    latency = (time.time() - start) * 1000
    
    log_metrics(QueryMetrics(
        question=question, answer=result.answer,
        latency_ms=latency, tokens_used=result.tokens,
        cache_hit=False, sources_count=len(result.sources),
        timestamp=datetime.now().isoformat(),
    ))
    
    # Cache result
    cache_answer(question, result.answer)
    
    return ChatResponse(answer=result.answer, sources=result.sources)

Bagian 6 — Cost Management

Berapa Biaya LLM di Production?

Model Input (per 1M token) Output (per 1M token) Typical query cost
GPT-4o $2.50 $10.00 ~$0.01-0.05
GPT-4o-mini $0.15 $0.60 ~$0.001-0.005
Claude Sonnet $3.00 $15.00 ~$0.01-0.08
Claude Haiku $0.25 $1.25 ~$0.001-0.005
Gemini Flash $0.075 $0.30 ~$0.0005-0.002

Contoh kalkulasi:

  • 1000 queries/hari × $0.03/query = $30/hari = $900/bulan
  • Dengan cache (50% hit rate): $450/bulan
  • Dengan model routing (simple → cheap, complex → expensive): $200/bulan

Strategi Hemat Cost

1. Model Routing

def route_to_model(question: str) -> str:
    """Route simple questions to cheap model, complex to expensive."""
    # Heuristic: panjang pertanyaan, keyword complexity
    if len(question.split()) < 10 and not any(
        kw in question.lower() for kw in ["jelaskan", "bandingkan", "analisis"]
    ):
        return "gpt-4o-mini"  # Simple question → cheap model
    return "gpt-4o"  # Complex → powerful model

# Atau pakai LLM classifier
ROUTER_PROMPT = """Classify this question complexity:
- SIMPLE: factual, short answer expected
- COMPLEX: needs reasoning, comparison, analysis

Question: {question}
Classification:"""

2. Prompt Optimization

# BURUK: prompt panjang, banyak token terbuang
bad_prompt = """
Kamu adalah asisten AI yang sangat pintar dan membantu. Kamu selalu 
menjawab dengan sopan dan detail. Kamu harus memastikan jawabanmu 
akurat dan berdasarkan fakta. Jika kamu tidak tahu, bilang tidak tahu.
Berikut adalah konteks yang relevan untuk menjawab pertanyaan user:
{context}
Berdasarkan konteks di atas, jawab pertanyaan berikut dengan lengkap:
{question}
"""

# BAIK: concise, same quality
good_prompt = """Context: {context}

Q: {question}
A (based on context only):"""

3. Token Budgeting

MAX_MONTHLY_BUDGET = 100  # USD
daily_budget = MAX_MONTHLY_BUDGET / 30
current_daily_spend = get_today_spend()

if current_daily_spend > daily_budget * 0.8:
    # Switch to cheaper model
    model = "gpt-4o-mini"
elif current_daily_spend > daily_budget:
    # Rate limit or queue
    raise HTTPException(429, "Daily budget exceeded. Try again tomorrow.")

4. Chunking Strategy (RAG)

# Smaller chunks = less tokens in context = cheaper
# But too small = miss context

# Sweet spot: 300-500 tokens per chunk, overlap 50
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)

# Retrieve fewer but more relevant chunks
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 3}  # 3 chunks, not 10
)

Bagian 7 — Scaling & Reliability

Rate Limiting

from fastapi import Request
from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@router.post("/chat")
@limiter.limit("10/minute")  # Max 10 queries per minute per IP
async def chat(request: Request, body: ChatRequest):
    ...

Fallback Strategy

async def get_llm_response(prompt: str) -> str:
    providers = [
        ("openai", "gpt-4o"),
        ("anthropic", "claude-sonnet"),
        ("openai", "gpt-4o-mini"),  # Fallback to cheaper
    ]
    
    for provider, model in providers:
        try:
            return await call_llm(provider, model, prompt)
        except (RateLimitError, TimeoutError, APIError) as e:
            logger.warning(f"{provider}/{model} failed: {e}")
            continue
    
    return "Maaf, layanan sedang tidak tersedia. Silakan coba lagi nanti."

Health Check

# app/routers/health.py
@router.get("/health")
async def health():
    checks = {
        "llm": await check_llm_connection(),
        "vectordb": await check_vectordb_connection(),
        "cache": check_cache_connection(),
    }
    
    all_healthy = all(checks.values())
    return {
        "status": "healthy" if all_healthy else "degraded",
        "checks": checks,
    }

Bagian 8 — CI/CD untuk LLM Apps

Pipeline

flowchart LR
    C["Code Push"] --> T["Unit Tests"]
    T --> E["Eval Tests<br/>(RAGAS)"]
    E --> B["Build Docker"]
    B --> S["Deploy Staging"]
    S --> M["Manual QA"]
    M --> P["Deploy Production"]
    P --> Mon["Monitor"]

GitHub Actions Example

# .github/workflows/deploy.yml
name: Deploy LLM App

on:
  push:
    branches: [main]

jobs:
  test-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install deps
        run: pip install -r requirements.txt
      
      - name: Run unit tests
        run: pytest tests/unit/
      
      - name: Run eval tests
        run: python eval_pipeline.py --threshold 0.8
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      
      - name: Deploy to Railway
        if: success()
        run: railway up --service my-llm-app
        env:
          RAILWAY_TOKEN: ${{ secrets.RAILWAY_TOKEN }}

Kesalahpahaman Umum

"Deploy = upload ke server, selesai" → Deploy baru awal. Monitor, iterate, maintain — itu yang ongoing.

"LLM app tidak perlu unit test" → Tetap perlu. Test guardrails, test routing logic, test API contracts. Eval test tambahan di atas unit test.

"Pakai model termahal = hasil terbaik" → Untuk 80% queries, model murah sudah cukup. Route pintar hemat 70% cost.

"Cache tidak berguna karena setiap pertanyaan unik" → Semantic cache menangkap pertanyaan yang mirip. Hit rate 30-50% sangat realistis.

"Scaling = butuh Kubernetes" → Untuk < 10k users/hari, PaaS (Railway/Render) sudah cukup. Jangan over-engineer.


Cek Pemahaman

  • Apa beda arsitektur demo vs production LLM app?
  • Sebut 3 deployment option dan kapan pakai masing-masing
  • Kenapa caching penting? Apa beda exact vs semantic cache?
  • Sebut 5 metrik yang harus dimonitor di production
  • Bagaimana model routing bisa hemat cost?
  • Apa itu fallback strategy dan kenapa penting?

Challenge 7B.3

Challenge 1 — Deploy ke Cloud (Wajib)

Deploy RAG chatbot dari Fase 7 ke Railway atau Render. Pastikan:

  • Bisa diakses via URL publik
  • Health check endpoint jalan
  • Environment variables aman (tidak hardcode API key)

Challenge 2 — Tambah Caching (Sedang)

Implementasikan exact cache dengan Redis (atau dict in-memory untuk demo). Ukur:

  • Latency dengan cache vs tanpa cache
  • Berapa % queries yang hit cache setelah 50 queries

Challenge 3 — Monitoring Dashboard (Sedang)

Tambahkan logging ke setiap query. Buat simple dashboard (bisa pakai Streamlit) yang menampilkan:

  • Total queries hari ini
  • Average latency
  • Cache hit rate
  • Top 5 pertanyaan paling sering

Challenge 4 — Cost Optimization (Sulit)

Implementasikan model routing: simple questions → GPT-4o-mini, complex → GPT-4o. Bandingkan:

  • Total cost per 100 queries (sebelum vs sesudah routing)
  • Quality score (dari eval) — apakah turun signifikan?

Challenge 5 — Full Production Stack (Sangat Sulit)

Gabungkan semua: FastAPI backend + guardrails + caching + monitoring + deploy. Ini bisa jadi portfolio project utama kamu.


Quote Penutup

"Everyone wants to build AI. Few want to operate AI."

Skill operasional (deploy, monitor, cost) adalah yang membedakan engineer yang bisa ship dari yang cuma bisa prototype. Kamu sekarang punya keduanya.


Selamat! Kamu sudah menyelesaikan seluruh materi persiapan. Kamu siap untuk bootcamp Dicoding dengan fondasi yang jauh lebih kuat dari rata-rata peserta.