07 — LLM API & Patterns
Estimasi: 6 jam Tujuan: Pakai LLM API (OpenAI, Anthropic, Gemini) di production. Skill yang akan langsung kepakai di capstone bootcamp.
Kenapa Materi Ini Penting?
Mayoritas produk GenAI tahun 2026 bukan training model dari nol, melainkan mengkomposisi LLM API existing jadi solusi bisnis. ChatGPT clone, code assistant, customer support bot, content generator — 90% dari mereka adalah wrapper cerdas di atas API OpenAI/Anthropic/Gemini. Materi ini mengajarkan skill yang langsung kepakai di capstone bootcamp dan di pekerjaan pertama kamu sebagai GenAI engineer.
Bayangkan kamu jadi kontraktor bangunan. Kamu tidak perlu bikin semen sendiri — beli dari pabrik (LLM API). Yang penting: kamu tahu cara mencampur (prompt engineering), kapan pakai semen mana (model selection), bagaimana hemat biaya (caching), dan bagaimana bangunan tetap berdiri saat ada gempa (error handling, retry, rate limit). File ini adalah panduan kontraktor mahir dari A sampai Z.
Tiga skill kunci yang akan kamu kuasai: (1) Provider basics — call OpenAI/Anthropic/Gemini dengan benar, termasuk streaming dan multi-turn, (2) Production patterns — retry, caching, rate limiting, cost tracking yang membedakan demo dari produk beneran, dan (3) Function calling — pola yang membuat LLM bisa "berinteraksi dengan dunia luar" (database, API, file system) — fondasi AI agents.
Peta Mental: Alur LLM API Call
Cara Membaca Diagram:
- Kiri = app, kanan = response ke user
- Cache di awal = cek hit sebelum call API
- Rate limiter = lindungi dari 429 error
- Track cost + save cache di akhir = pola production-grade
Walkthrough Step-by-Step:
- App terima request, build prompt (system + user message)
- Cek cache — kalau hit, langsung return (hemat cost & latency)
- Kalau miss, lewati rate limiter untuk kontrol throughput
- Client.create kirim ke LLM API (OpenAI / Claude / Gemini)
- Response berisi tokens out + usage info
- Track cost akumulasi cost untuk monitoring
- Save to cache untuk request berikutnya yang sama
Analogi Sehari-hari: App ↔ LLM API = telepon konsultan ahli berbayar per menit. Cache = catatan jawaban konsultan yang pernah ditanya. Rate limiter = aturan jangan telepon terlalu sering. Track cost = catat tagihan biar tidak kaget akhir bulan.
Diagram statis Mermaid sebagai fallback:
flowchart LR
App["💻 App"] --> Prep["📝 Build prompt<br/>(system + user)"]
Prep --> Cache{"🗄️ Cache?"}
Cache -->|"hit"| Ret["✅ Return cached"]
Cache -->|"miss"| Lim["⏱️ Rate limiter"]
Lim --> Client["🌐 Client.create"]
Client --> API["🤖 LLM API<br/>(OpenAI/Claude/Gemini)"]
API --> Tok["🔢 Tokens out"]
Tok --> Track["📊 Track cost"]
Track --> Save["💾 Save to cache"]
Save --> Ret
style App fill:#dbeafe
style API fill:#fed7aa
style Ret fill:#d1fae5
Bagian 1 — Pilih Provider
| Provider | Pros | Cons | Free Tier |
|---|---|---|---|
| OpenAI (GPT) | Standar industri | Top-up minimum $5 | ❌ |
| Anthropic (Claude) | Strong reasoning, coding | Top-up minimum | ✅ Trial credit |
| Google (Gemini) | Multimodal, free tier generous | Stability kadang | ✅ Free tier |
| Open (Ollama, vLLM) | Gratis, privat | Setup, infrastruktur | ✅ Lokal |
Untuk belajar: Gemini Free Tier — paling murah/gratis. Daftar di aistudio.google.com.
Bagian 2 — Setup API Key
Best Practice: Environment Variables
# .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AI...
# pip install python-dotenv
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.environ["GEMINI_API_KEY"]
.gitignore Wajib
.env
*.env
**/secrets.json
JANGAN PERNAH commit API key. Sekali bocor, anggap dompet kamu kena.
Bagian 3 — OpenAI API
Analogi LLM API: Bayangkan kamu menelepon konsultan ahli. Kamu kasih konteks (system message), tanya pertanyaan (user message), tunggu jawaban (response). Bayar per menit telepon (per token). Konsultan ini ribuan, mereka dilatih ribuan jam. Kerennya: kamu cuma butuh internet + API key + ~10 baris code.
Visualisasi: Anatomi Satu API Call
flowchart TD
Sys["🎭 System message<br/>'Kamu adalah tutor AI'"] --> Msg["📨 messages array"]
User["👤 User message<br/>'Apa itu transformer?'"] --> Msg
Msg --> Req["📤 POST request"]
Key["🔑 API Key"] --> Req
Params["⚙️ Params:<br/>temperature, max_tokens"] --> Req
Req --> Net["🌐 HTTPS"]
Net --> Server["🖥️ OpenAI server"]
Server --> Inf["🧠 GPT inference"]
Inf --> Resp["📥 Response<br/>(content + usage)"]
Resp --> App["💻 Your app"]
style Sys fill:#fef3c7
style User fill:#dbeafe
style Server fill:#fed7aa
style App fill:#d1fae5
pip install openai
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Simple completion
response = client.chat.completions.create(
model="gpt-4o-mini", # cheap, fast
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Apa itu machine learning?"},
],
temperature=0.7,
max_tokens=500,
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.total_tokens}")
Streaming
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Tulis cerita pendek"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Structured Output
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "Extract: 'Budi 25 tahun, suka coding'"}
],
response_format={"type": "json_object"},
)
Bagian 4 — Anthropic API (Claude)
pip install anthropic
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = client.messages.create(
model="claude-haiku-4-5-20251001", # cheap & fast
max_tokens=500,
messages=[
{"role": "user", "content": "Apa itu RAG?"}
],
system="You are a helpful AI tutor.",
)
print(response.content[0].text)
Prompt Caching (Hemat 90% Cost untuk Long Context)
# Cache instructions/context yang panjang & dipakai berulang
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=500,
system=[
{
"type": "text",
"text": "You are a helpful AI tutor specialized in...[long instructions]",
"cache_control": {"type": "ephemeral"}, # cache 5 menit
}
],
messages=[{"role": "user", "content": "Apa itu RAG?"}],
)
# Lihat cache hit di response.usage
print(response.usage.cache_read_input_tokens) # token dari cache (90% diskon)
print(response.usage.cache_creation_input_tokens) # token cache pertama (25% premium)
Kapan pakai prompt caching? System prompt > 1024 token, dipakai > 2x dalam 5 menit. Bisa hemat sampai 90% cost di production.
Multi-turn
messages = [
{"role": "user", "content": "Halo"},
{"role": "assistant", "content": "Halo! Apa yang bisa saya bantu?"},
{"role": "user", "content": "Jelaskan transformer"},
]
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1000,
messages=messages,
)
Bagian 5 — Gemini API
pip install google-generativeai
import google.generativeai as genai
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-1.5-flash") # gratis tier
response = model.generate_content("Apa itu deep learning?")
print(response.text)
Chat (Multi-turn)
chat = model.start_chat()
response = chat.send_message("Halo, perkenalkan diri")
print(response.text)
response = chat.send_message("Apa rekomendasi belajar AI?")
print(response.text)
# History
print(chat.history)
Multimodal (Image)
import PIL.Image
img = PIL.Image.open("photo.jpg")
response = model.generate_content(["Apa yang kamu lihat di gambar ini?", img])
print(response.text)
Bagian 6 — Common Patterns
Pattern 1: Wrapper Class
class LLMClient:
def __init__(self, provider="gemini"):
self.provider = provider
if provider == "gemini":
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
self.client = genai.GenerativeModel("gemini-1.5-flash")
def generate(self, prompt: str, **kwargs) -> str:
if self.provider == "gemini":
response = self.client.generate_content(prompt)
return response.text
# tambahkan provider lain
def chat(self, messages: list[dict]) -> str:
# Implement
pass
Pattern 2: Retry dengan Exponential Backoff
import time
from functools import wraps
def retry_with_backoff(max_attempts=3, base_delay=1):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Retry in {delay}s: {e}")
time.sleep(delay)
return wrapper
return decorator
@retry_with_backoff(max_attempts=3)
def call_llm(prompt):
return client.generate_content(prompt).text
Pattern 3: Cost Tracking
class TokenTracker:
def __init__(self):
self.input_tokens = 0
self.output_tokens = 0
def add(self, input_t, output_t):
self.input_tokens += input_t
self.output_tokens += output_t
def cost_estimate(self, in_price=0.0001, out_price=0.0003):
return (self.input_tokens * in_price + self.output_tokens * out_price) / 1000
tracker = TokenTracker()
# After each call
tracker.add(response.usage.prompt_tokens, response.usage.completion_tokens)
print(f"Cost so far: ${tracker.cost_estimate():.4f}")
Pattern 4: Caching
import json
import hashlib
from pathlib import Path
cache_dir = Path("./llm_cache")
cache_dir.mkdir(exist_ok=True)
def cache_key(prompt: str) -> str:
return hashlib.md5(prompt.encode()).hexdigest()
def cached_generate(prompt: str) -> str:
key = cache_key(prompt)
cache_file = cache_dir / f"{key}.json"
if cache_file.exists():
return json.loads(cache_file.read_text())["response"]
response = client.generate_content(prompt).text
cache_file.write_text(json.dumps({"prompt": prompt, "response": response}))
return response
Hemat biaya saat develop.
Pattern 5: Function Calling / Tool Use
Analogi Function Calling: LLM standalone = konsultan yang cuma tahu sampai 2024 (knowledge cutoff). Function calling = memberi konsultan itu telepon untuk akses dunia luar — cek cuaca real-time, baca database, kirim email. LLM tidak eksekusi function-nya, dia cuma bilang "panggil function X dengan argumen Y" dan kode kamu yang eksekusi. Lalu kamu kirim hasilnya balik ke LLM untuk dibuat jawaban natural.
Alur Function Calling
flowchart LR
User["👤 'Cuaca Jakarta?'"] --> LLM1["🤖 LLM"]
Tools["🛠️ Tools schema<br/>get_weather(city)"] --> LLM1
LLM1 --> Decide{"Butuh tool?"}
Decide -->|"ya"| Call["📞 tool_call:<br/>get_weather('Jakarta')"]
Call --> App["💻 Your code<br/>execute function"]
App --> Result["📊 Result:<br/>'30°C, cerah'"]
Result --> LLM2["🤖 LLM (round 2)"]
LLM2 --> Final["✅ 'Jakarta 30°C cerah hari ini'"]
Decide -->|"tidak"| Final
style LLM1 fill:#fed7aa
style LLM2 fill:#fed7aa
style Final fill:#d1fae5
# OpenAI style
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Cuaca Jakarta?"}],
tools=tools,
)
# Cek kalau model mau panggil function
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
function_name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
# Jalankan function
if function_name == "get_weather":
result = get_weather(**args)
# Send result back ke model
# ... (multi-turn)
Bagian 7 — Async untuk Throughput
Untuk batch processing besar:
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(api_key=...)
async def call_one(prompt):
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def main():
prompts = ["What is X?" for X in topics]
results = await asyncio.gather(*[call_one(p) for p in prompts])
return results
results = asyncio.run(main())
Speed up 10-50x untuk batch processing.
Bagian 8 — Production Considerations
Rate Limit
API punya limit (e.g., 60 req/min). Implement throttling:
import time
from collections import deque
class RateLimiter:
def __init__(self, max_calls, period_seconds):
self.max_calls = max_calls
self.period = period_seconds
self.calls = deque()
def wait_if_needed(self):
now = time.time()
# Hapus call yang sudah lebih dari period
while self.calls and now - self.calls[0] > self.period:
self.calls.popleft()
if len(self.calls) >= self.max_calls:
sleep_time = self.period - (now - self.calls[0])
time.sleep(sleep_time)
self.calls.append(now)
limiter = RateLimiter(max_calls=60, period_seconds=60)
Error Handling
from openai import OpenAIError, RateLimitError, APIError
try:
response = client.chat.completions.create(...)
except RateLimitError:
# Wait + retry
pass
except APIError as e:
# Log + user-friendly message
pass
except Exception as e:
# Generic
pass
Logging
import logging
logger = logging.getLogger(__name__)
def call_llm(prompt):
logger.info(f"LLM call - prompt length: {len(prompt)}")
response = client.generate_content(prompt)
logger.info(f"LLM response - tokens: {response.usage}")
return response.text
Monitoring
- Latency — p50, p95, p99
- Token usage — per request, daily total
- Error rate — by error type
- Cost — daily, per user
Tools: Helicone, LangSmith, atau custom dashboard.
Bagian 9 — Common Mistakes & FAQ
1. API Key di Code (BAHAYA)
# ❌ JANGAN PERNAH
client = OpenAI(api_key="sk-abc123...") # commit = leaked = wallet kena
# ✅ Pakai environment variable
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
Pertolongan pertama kalau kebocoran: Revoke key segera di dashboard provider. Generate baru. Audit usage 24 jam terakhir.
2. Tidak Set max_tokens
# ❌ Tanpa limit, model bisa generate ribuan token = boros
response = client.messages.create(
model="claude-haiku-4-5-20251001",
messages=[{"role": "user", "content": "Tulis cerita"}],
# max_tokens missing!
)
# ✅ Selalu set limit
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=500, # cap output
messages=[...],
)
3. Lupa Handle Rate Limit
# ❌ Loop tanpa rate limit → 429 error setelah beberapa request
for prompt in 1000_prompts:
response = client.messages.create(...)
# ✅ Pakai retry dengan exponential backoff
@retry_with_backoff(max_attempts=5, base_delay=2)
def safe_call(prompt):
return client.messages.create(...)
4. Streaming Tanpa Flush
# ❌ Output muncul sekaligus di akhir
for chunk in stream:
print(chunk.choices[0].delta.content, end="")
# ✅ Force flush biar real-time
for chunk in stream:
print(chunk.choices[0].delta.content, end="", flush=True)
5. Multi-turn Tanpa History Management
# ❌ Context jadi besar tak terbatas → mahal & kena context limit
messages = []
while True:
user = input()
messages.append({"role": "user", "content": user})
response = client.messages.create(messages=messages, ...)
messages.append({"role": "assistant", "content": response.content[0].text})
# messages tumbuh tanpa batas!
# ✅ Truncate / summarize saat history > N turn
MAX_TURNS = 10
if len(messages) > MAX_TURNS * 2:
# Keep first system + last N turns
messages = messages[:1] + messages[-MAX_TURNS*2:]
6. JSON Output Tidak Di-validate
# ❌ Asumsi LLM selalu return JSON valid
response = client.chat.completions.create(
response_format={"type": "json_object"},
messages=[...],
)
data = json.loads(response.choices[0].message.content) # bisa raise
# ✅ Try/except + retry
try:
data = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
# Retry, atau pakai structured output (Pydantic)
pass
7. Tidak Track Cost di Development
# Tanpa tracking, kamu kaget di akhir bulan ($50 in 1 day = real story)
# ✅ Track tiap call
total_cost = 0
def call_with_tracking(prompt):
global total_cost
response = client.messages.create(...)
cost = (response.usage.input_tokens * 0.001 +
response.usage.output_tokens * 0.005) / 1000
total_cost += cost
print(f"Call: ${cost:.4f}, Total: ${total_cost:.2f}")
return response
8. Mismatched Model Names (Sering Outdated)
# ❌ Model name lama / typo
model="gpt-4-turbo-preview" # mungkin sudah deprecated
model="claude-3-opus" # versi tanggal hilang
# ✅ Cek docs terbaru, pakai full versioned name
model="gpt-4o-mini" # OpenAI 2026
model="claude-haiku-4-5-20251001" # Anthropic dengan tanggal
9. Async Tanpa Concurrency Limit
# ❌ asyncio.gather 1000 request paralel → kena rate limit + memory blow up
results = await asyncio.gather(*[call_one(p) for p in 1000_prompts])
# ✅ Pakai semaphore
sem = asyncio.Semaphore(10) # max 10 concurrent
async def bounded_call(prompt):
async with sem:
return await call_one(prompt)
results = await asyncio.gather(*[bounded_call(p) for p in 1000_prompts])
Bagian 10 — Pricing Cheat Sheet (Per 1M Token)
Harga 2026, bisa berubah. Cek docs provider untuk angka terkini.
| Model | Input | Output | Context | Best for |
|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 128k | Premium tasks |
| GPT-4o-mini | $0.15 | $0.60 | 128k | Most tasks |
| Claude Sonnet 4.5 | $3.00 | $15.00 | 200k | Reasoning, coding |
| Claude Haiku 4.5 | $0.25 | $1.25 | 200k | Most tasks, cheap |
| Gemini 1.5 Flash | $0.075 | $0.30 | 1M | Bulk, multimodal |
| Gemini 1.5 Pro | $1.25 | $5.00 | 2M | Long context |
Cost Estimation
def estimate_cost(input_tokens, output_tokens, model="haiku"):
rates = {
"gpt-4o-mini": (0.15, 0.60),
"gpt-4o": (2.50, 10.0),
"haiku": (0.25, 1.25),
"sonnet": (3.00, 15.0),
"gemini-flash": (0.075, 0.30),
}
in_rate, out_rate = rates[model]
return (input_tokens * in_rate + output_tokens * out_rate) / 1_000_000
# Contoh: 1000 query, 500 token input, 200 token output
print(estimate_cost(500*1000, 200*1000, "haiku")) # $0.375
Bagian 11 — End-to-End Example: Mini RAG dengan LLM API
Pattern yang akan kamu pakai berkali-kali. Combine semua yang sudah dipelajari:
import os
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np
from dotenv import load_dotenv
load_dotenv()
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# === 1. Knowledge Base ===
docs = [
"Python adalah bahasa pemrograman tingkat tinggi.",
"PyTorch dipakai untuk deep learning.",
"Hugging Face menyediakan ribuan model pretrained.",
"Transformer adalah arsitektur LLM modern.",
]
doc_embeddings = embedder.encode(docs)
# === 2. Retrieval ===
def retrieve(query, top_k=2):
query_emb = embedder.encode(query)
scores = doc_embeddings @ query_emb
top_idx = np.argsort(scores)[-top_k:][::-1]
return [docs[i] for i in top_idx]
# === 3. RAG Generation ===
def rag_answer(question):
context = "\n".join(retrieve(question))
prompt = f"""Konteks:
{context}
Pertanyaan: {question}
Jawab berdasarkan konteks saja. Jika tidak ada di konteks, bilang tidak tahu."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
print(rag_answer("Apa fungsi PyTorch?"))
print(rag_answer("Apa itu Hugging Face?"))
Ini fondasi RAG yang akan diperdalam di Fase 7. Sudah punya kerangka — tinggal scale up dengan vector database & chunking strategy.
Cek Pemahaman
- Bisa setup API key dengan .env?
- Bisa call OpenAI/Anthropic/Gemini API?
- Tahu cara streaming output?
- Bisa multi-turn conversation?
- Tahu retry pattern + caching?
- Bisa function calling / tool use?
Challenge 6.7
Challenge 1 — Build LLM Wrapper
Bikin class LLMClient yang support 3 provider:
- Gemini
- Claude (kalau punya credit)
- Local (Ollama)
Method: generate(), chat(), embed().
Challenge 2 — Translation Tool
CLI tool:
python translate.py --to en "Saya suka belajar AI"
python translate.py --from id --to ja "Saya suka kucing"
Pakai Gemini API. Cache hasil supaya hemat.
Challenge 3 — Code Review Bot
Script yang baca file .py, kirim ke LLM, dapat feedback:
- Issues
- Suggestions
- Refactored version
Output ke review.md.
Challenge 4 — Multi-Turn Chatbot
CLI chatbot yang:
- Multi-turn (ingat konteks)
- Save history ke JSON
- Bisa reload session
- Token tracker
Challenge 5 — Batch Processing
100 review produk, classify sentiment dengan LLM. Pakai async untuk speed.
Challenge 6 — Function Calling
Bikin "personal assistant" yang bisa:
get_weather(city)add_to_calendar(title, datetime)send_email(to, subject, body)(mock)
LLM panggil function appropriate berdasarkan natural language.
Selanjutnya: challenges.md