The Challenge
Predicting income brackets from census data is a classic ML classification problem, but applying it to real-world data with 32,000+ records introduces challenges that textbook examples gloss over: class imbalance, mixed feature types, and the need for an interpretable model that stakeholders can trust.
Data Preprocessing Pipeline
The Adult Income dataset from the UCI Machine Learning Repository contains 14 features — a mix of continuous variables (age, hours-per-week, capital-gain) and categorical ones (education, occupation, marital-status). My preprocessing pipeline handled:
- Missing value imputation using mode for categorical and median for numerical features
- One-hot encoding for nominal categories, ordinal encoding for education levels
- Feature scaling with StandardScaler on numerical columns
- Class weight balancing to address the 75/25 income split
Model Comparison
I evaluated four architectures on stratified 5-fold cross-validation:
| Model | Accuracy | F1 Score | AUC-ROC |
|---|---|---|---|
| Logistic Regression | 79.2% | 0.61 | 0.83 |
| Random Forest | 84.1% | 0.71 | 0.90 |
| Gradient Boosting | 85.3% | 0.73 | 0.91 |
| K-Nearest Neighbors | 81.7% | 0.65 | 0.86 |
Gradient Boosting won on all three metrics. The key insight was that tree-based models naturally handle the mixed feature types and non-linear relationships in census data.
Deployment with FastAPI
The final model is served via a FastAPI backend with a React frontend for the analytics dashboard. The API accepts JSON payloads with the 14 features and returns a probability distribution across income brackets, not just a binary prediction — giving users confidence scores alongside the classification.
Key Takeaways
- Feature engineering matters more than model choice — creating interaction features between education and occupation improved accuracy by 2.1 percentage points.
- Class imbalance handling is essential — without SMOTE or class weights, the model achieves 76% accuracy but an F1 of only 0.45 on the minority class.
- Interpretability wins trust — SHAP value visualizations in the dashboard help users understand why the model made each prediction.
The complete source code is available on GitHub.