Building an Income Prediction System with Machine Learning

The Challenge

Predicting income brackets from census data is a classic ML classification problem, but applying it to real-world data with 32,000+ records introduces challenges that textbook examples gloss over: class imbalance, mixed feature types, and the need for an interpretable model that stakeholders can trust.

Data Preprocessing Pipeline

The Adult Income dataset from the UCI Machine Learning Repository contains 14 features — a mix of continuous variables (age, hours-per-week, capital-gain) and categorical ones (education, occupation, marital-status). My preprocessing pipeline handled:

Missing value imputation using mode for categorical and median for numerical features
One-hot encoding for nominal categories, ordinal encoding for education levels
Feature scaling with StandardScaler on numerical columns
Class weight balancing to address the 75/25 income split

Model Comparison

I evaluated four architectures on stratified 5-fold cross-validation:

Model	Accuracy	F1 Score	AUC-ROC
Logistic Regression	79.2%	0.61	0.83
Random Forest	84.1%	0.71	0.90
Gradient Boosting	85.3%	0.73	0.91
K-Nearest Neighbors	81.7%	0.65	0.86

Gradient Boosting won on all three metrics. The key insight was that tree-based models naturally handle the mixed feature types and non-linear relationships in census data.

Deployment with FastAPI

The final model is served via a FastAPI backend with a React frontend for the analytics dashboard. The API accepts JSON payloads with the 14 features and returns a probability distribution across income brackets, not just a binary prediction — giving users confidence scores alongside the classification.

Key Takeaways

Feature engineering matters more than model choice — creating interaction features between education and occupation improved accuracy by 2.1 percentage points.
Class imbalance handling is essential — without SMOTE or class weights, the model achieves 76% accuracy but an F1 of only 0.45 on the minority class.
Interpretability wins trust — SHAP value visualizations in the dashboard help users understand why the model made each prediction.

The complete source code is available on GitHub.