07 — Kaggle Submission Project
Estimasi: 5 jam Tujuan: End-to-end project Kaggle dengan submission. Skor tidak penting — proses dan portfolio yang dihargai.
Kenapa Materi Ini Penting?
Kaggle submission adalah portfolio piece paling powerful untuk job hunt entry-level data scientist. HR dan hiring manager scout Kaggle profile. Skor kamu tidak harus tinggi — yang dihargai adalah proses (EDA mendalam, feature engineering kreatif, eksperimen sistematis) dan kemampuan present hasil.
Untuk Dicoding bootcamp, project Kaggle akan jadi bahan diskusi di sesi review expert dan sangat memengaruhi penilaian akhir. Lebih dari itu, melalui Kaggle kamu belajar disiplin ML yang sulit didapat dari tutorial: dealing with messy real data, iterasi berkali-kali, baca diskusi komunitas, dan rasakan kompetisi. Ini latihan terbaik untuk bekerja di project ML production.
Cara Membaca Diagram:
- Kiri ke kanan: alur full project dari download sampai submit
- Loop balik (dashed) dari submit → feature engineering = iterasi improvement
- Akhir: push GitHub + LinkedIn = portfolio
Walkthrough Step-by-Step:
- Download data lewat Kaggle CLI
- EDA — pahami data, find insight
- Feature engineering — buat feature kreatif (sering paling impactful)
- Train multiple models (LogReg, RF, XGBoost)
- Compare via cross-validation
- Tune model terbaik (GridSearch / Optuna)
- Predict test set, submit ke Kaggle
- Lihat skor, iterate (loop balik ke FE)
- Push ke GitHub + LinkedIn post
Analogi Sehari-hari: Lomba masak. Belanja bahan (data), siapkan bahan (FE), masak beberapa resep (models), tasting (CV), pilih terbaik, sajikan ke juri (submit), terima feedback (leaderboard), iterate. Foto + cerita di Instagram = portfolio.
Diagram statis Mermaid sebagai fallback:
flowchart LR
A["📥 Download<br/>Dataset"] --> B["🔍 EDA"]
B --> C["⚙️ Feature<br/>Engineering"]
C --> D["🤖 Train<br/>Multiple Models"]
D --> E["📊 CV<br/>Compare"]
E --> F["🔧 Tune Best<br/>Model"]
F --> G["🎯 Predict<br/>Test Set"]
G --> H["📤 Submit<br/>to Kaggle"]
H --> I["📈 Iterate<br/>Improve"]
I --> C
H --> J["📝 Push to<br/>GitHub"]
style J fill:#d4f4dd
Pilih Kompetisi
Pemula (Recommended Pertama)
- Titanic (titanic) — binary classification klasik
- House Prices (house-prices-advanced-regression-techniques) — regression klasik
- Spaceship Titanic — penerus Titanic, sedikit lebih kompleks
Ketiganya selalu open, untuk learning.
Intermediate
- Tabular Playground Series — bulanan
- Featured Competitions dengan tabular data
Mulai dari Titanic. Yang sudah pernah submit, naik level.
Workflow Lengkap
Analogi: Resep Lengkap dari Bahan Mentah ke Restoran
Kaggle project = full pipeline restoran: download bahan (data), cek kualitas (EDA), olah bahan (feature engineering), masak beberapa resep (multiple models), tasting (CV), pilih resep terbaik (best model), tuning bumbu (hyperparameter), siap saji (submission), terima feedback pelanggan (leaderboard), iterate.
flowchart TD
A["1️⃣ Setup Kaggle CLI"] --> B["2️⃣ Download Data"]
B --> C["3️⃣ Notebook 01: EDA"]
C --> D["4️⃣ Notebook 02:<br/>Feature Engineering"]
D --> E["5️⃣ Notebook 03:<br/>Modeling + CV"]
E --> F["6️⃣ Notebook 04:<br/>Tuning + Submission"]
F --> G["7️⃣ Submit ke Kaggle"]
G --> H{"Score OK?"}
H -->|Tidak| D
H -->|Ya| I["8️⃣ Push GitHub +<br/>LinkedIn Post"]
style I fill:#d4f4dd
Step 1: Setup Kaggle
pip install kaggle
Get API key dari kaggle.com → account → "Create New API Token". Save ke ~/.kaggle/kaggle.json.
kaggle competitions download -c titanic
unzip titanic.zip -d data/
Step 2: Notebook Structure
projects/kaggle-titanic/
├── README.md
├── notebooks/
│ ├── 01-eda.ipynb
│ ├── 02-feature-engineering.ipynb
│ ├── 03-modeling.ipynb
│ └── 04-final-submission.ipynb
├── data/
│ ├── train.csv
│ ├── test.csv
│ └── submission.csv
└── models/
└── best_model.pkl
Step 3: Notebook 1 — EDA
Sudah dibahas Fase 4. Summary:
- Inspect shape, dtypes, missing
- Distribusi tiap fitur
- Target distribution
- Correlation
- Hypotheses
Step 4: Notebook 2 — Feature Engineering
def feature_engineering(df):
df = df.copy()
# Missing
df["Age"] = df["Age"].fillna(df["Age"].median())
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])
df["Fare"] = df["Fare"].fillna(df["Fare"].median())
# Extract title
df["Title"] = df["Name"].str.extract(r" ([A-Za-z]+)\.")
title_map = {"Mr": "Mr", "Miss": "Miss", "Mrs": "Mrs", "Master": "Master"}
df["Title"] = df["Title"].map(title_map).fillna("Other")
# Family
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = (df["FamilySize"] == 1).astype(int)
# Cabin
df["HasCabin"] = df["Cabin"].notna().astype(int)
# Bin age
df["AgeBin"] = pd.cut(df["Age"], bins=[0, 12, 18, 35, 60, 100],
labels=["Child", "Teen", "Young", "Adult", "Senior"])
# Drop unused
df = df.drop(columns=["Name", "Ticket", "Cabin", "PassengerId"])
return df
train_fe = feature_engineering(train)
test_fe = feature_engineering(test)
Step 5: Notebook 3 — Modeling
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import xgboost as xgb
numeric = ["Age", "Fare", "FamilySize", "Pclass"]
categorical = ["Sex", "Embarked", "Title", "AgeBin"]
preprocessor = ColumnTransformer([
("num", StandardScaler(), numeric),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical),
])
# Try multiple models
models = {
"LogReg": LogisticRegression(max_iter=1000),
"RF": RandomForestClassifier(random_state=42),
"XGB": xgb.XGBClassifier(random_state=42, eval_metric="logloss"),
}
for name, model in models.items():
pipeline = Pipeline([
("preprocessor", preprocessor),
("model", model),
])
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"{name}: {scores.mean():.4f} ± {scores.std():.4f}")
Step 6: Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
best_pipeline = Pipeline([
("preprocessor", preprocessor),
("model", xgb.XGBClassifier(random_state=42)),
])
param_grid = {
"model__n_estimators": [100, 200, 500],
"model__max_depth": [3, 5, 7],
"model__learning_rate": [0.05, 0.1, 0.2],
}
grid = GridSearchCV(best_pipeline, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
Step 7: Submission
# Train final model on full data
final_model = grid.best_estimator_
final_model.fit(X_train, y_train)
# Predict test set
predictions = final_model.predict(test_fe)
# Submission file
submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": predictions
})
submission.to_csv("submission.csv", index=False)
kaggle competitions submit -c titanic -f submission.csv -m "First submission"
Step 8: Iterate
flowchart LR
A["📊 Lihat<br/>Leaderboard"] --> B["🔍 Identifikasi<br/>weak spot"]
B --> C{"Apa yang<br/>bisa improve?"}
C -->|Feature kurang| D["⚙️ Tambah feature<br/>engineering"]
C -->|Model kurang akurat| E["🤖 Coba model lain<br/>atau ensemble"]
C -->|Hyperparameter| F["🔧 Tuning lebih<br/>aggresive (Optuna)"]
D --> G["🔁 Resubmit"]
E --> G
F --> G
G --> A
- Lihat leaderboard
- Identifikasi yang bisa improve
- Iterate (try fitur baru, model lain, ensemble)
- Resubmit
Tips Kaggle untuk Pemula
1. Baca Diskusi
Forum Kaggle = emas. Banyak yang share teknik. Pelajari.
2. Pelajari Public Notebook
Pemenang sering share. Baca, tapi jangan copy buta. Pahami.
3. Feature Engineering > Model Tuning
Untuk pemula, feature bagus 10x lebih impactful dari model tuning.
4. Ensemble Sering Menang
Combine prediksi banyak model:
final_pred = (rf.predict_proba(X)[:, 1] + xgb.predict_proba(X)[:, 1]) / 2
5. Cross-Validation Setup yang Konsisten
Bandingkan model fairly — pakai CV yang sama.
6. Watch out Data Leakage
Pakai pipeline. Fit preprocessor di train fold saja per CV iteration.
Common Mistakes & FAQ
Common Mistakes
flowchart TD
A["⚠️ Common Kaggle Mistakes"] --> B["💧 Data leakage<br/>fit di full data"]
A --> C["📊 Train CV bagus<br/>LB jelek"]
A --> D["🎯 Overfit<br/>public LB"]
A --> E["🤖 Langsung XGBoost<br/>tanpa baseline"]
A --> F["⚙️ Cuma tune model<br/>tanpa feature eng"]
A --> G["📤 Lupa cek format<br/>submission"]
style A fill:#ffe0e0
1. Data Leakage di Pipeline
❌ Fit scaler/encoder di full data (train + test) → info test bocor.
✅ Pakai sklearn Pipeline yang otomatis handle. Atau fit di train fold saja.
2. CV Bagus, LB Jelek
CV F1 = 0.85 tapi public leaderboard cuma 0.70 → distribution shift atau overfitting CV.
✅ Pakai stratified CV, lebih banyak fold, jangan optimize CV terlalu agresif.
3. Overfit Public Leaderboard
Public LB = subset test set. Kalau kamu submit 100 kali dan tweak berdasarkan public LB, kamu fit ke public set tapi gagal di private set (final scoring).
✅ Trust your CV. Submit secukupnya untuk validasi, jangan submit per micro-tweak.
4. Langsung XGBoost
Banyak yang start dari XGBoost. Tanpa baseline (LogReg / RF), kamu tidak tahu apakah complexity worth it.
✅ Mulai dari DummyClassifier → LogReg → RF → XGBoost. Compare improvement.
5. Tuning Model > Feature Engineering
Tuning XGBoost dari 0.78 ke 0.79 butuh berhari-hari. Feature baru bisa kasih +0.05 dalam 30 menit.
✅ 80% waktu di feature engineering, 20% di tuning.
6. Format Submission Salah
❌ Submit file dengan kolom typo, missing index, atau nilai negatif (untuk klasifikasi binary). Otomatis rejected.
✅ Selalu cek sample_submission.csv dan pastikan format identik.
FAQ
Q: Skor berapa yang bagus untuk Titanic? A: 0.78-0.80 = solid (top 30%). 0.80+ = bagus (top 10%). 0.83+ kemungkinan overfit / data leakage.
Q: Berapa lama untuk first submission? A: 4-8 jam realistis untuk pemula. Jangan rush. Quality > speed.
Q: Public vs Private leaderboard? A: Public = score dari subset test (visible). Private = full test (hidden, dipakai untuk final). Jangan overfit public.
Q: Apakah saya harus pakai deep learning? A: Untuk tabular Kaggle, NO. XGBoost / LightGBM hampir selalu menang. DL untuk image/text/audio.
Q: Boleh copy notebook publik? A: Boleh untuk learning, tidak untuk submission. Pahami, modifikasi, jangan plagiat.
Q: Bagaimana cara dapat medali Kaggle? A: Butuh waktu dan latihan. Bronze (top 10%) achievable untuk pemula serius. Silver/Gold butuh experience.
Q: Ensemble itu penting? A: Untuk push score top 5%, ya. Untuk learning dan portfolio awal, single model bagus sudah cukup.
Cek Pemahaman
- Bisa setup Kaggle CLI dan download dataset?
- Bisa struktur project notebook yang rapi?
- Bisa workflow EDA → FE → Modeling → Submission?
- Tahu cara avoid public LB overfitting?
- Bisa submit dan iterate berdasarkan score?
Submission ke Portfolio
README
# Kaggle: Titanic Survival Prediction
End-to-end ML project pada dataset Titanic.
## Approach
1. EDA — identified 5 key insights
2. Feature engineering — extracted Title, FamilySize, IsAlone, ...
3. Modeling — compared LogReg, RF, XGBoost
4. Tuning — GridSearch on XGBoost
5. Submission — score 0.78
## Best Score
Public leaderboard: **0.78** (top 30%)
## Key Findings
- Title (Mr, Mrs, etc.) is strong predictor
- FamilySize > 4 reduces survival probability
- ...
## Stack
- Python, pandas, sklearn, xgboost
- Notebook-based workflow
## Reproduce
\`\`\`bash
kaggle competitions download -c titanic
jupyter notebook notebooks/04-final-submission.ipynb
\`\`\`
Update Profile README
Tambah link project + Kaggle profile.
LinkedIn Post
Bagikan journey:
- Apa yang dipelajari
- Score (kalau bagus)
- Insight
- Link GitHub + Kaggle
HR sering scout di LinkedIn. Posting konsisten = visibility naik.
Challenge 5.7
Challenge — Full Kaggle Project
- Pilih kompetisi (Titanic untuk pertama)
- Lakukan workflow lengkap di atas
- Submit minimal 3 kali (iterate)
- Push notebook ke GitHub
- Posting di LinkedIn
Tujuan: ini akan jadi portfolio piece kuat saat job hunt + bahan diskusi di sesi review expert bootcamp.
Selanjutnya: challenges.md