Machine LearningPythonFastAPI

    Building an Income Prediction System with Machine Learning

    How I engineered a machine learning pipeline that achieves 85% accuracy on 32K+ census records — from data preprocessing to FastAPI deployment.

    AR
    Abdul Rahman Azam
    |March 25, 20268 min read

    The Challenge

    Predicting income brackets from census data is a classic ML classification problem, but applying it to real-world data with 32,000+ records introduces challenges that textbook examples gloss over: class imbalance, mixed feature types, and the need for an interpretable model that stakeholders can trust.

    Data Preprocessing Pipeline

    The Adult Income dataset from the UCI Machine Learning Repository contains 14 features — a mix of continuous variables (age, hours-per-week, capital-gain) and categorical ones (education, occupation, marital-status). My preprocessing pipeline handled:

    • Missing value imputation using mode for categorical and median for numerical features
    • One-hot encoding for nominal categories, ordinal encoding for education levels
    • Feature scaling with StandardScaler on numerical columns
    • Class weight balancing to address the 75/25 income split

    Model Comparison

    I evaluated four architectures on stratified 5-fold cross-validation:

    ModelAccuracyF1 ScoreAUC-ROC
    Logistic Regression79.2%0.610.83
    Random Forest84.1%0.710.90
    Gradient Boosting85.3%0.730.91
    K-Nearest Neighbors81.7%0.650.86

    Gradient Boosting won on all three metrics. The key insight was that tree-based models naturally handle the mixed feature types and non-linear relationships in census data.

    Deployment with FastAPI

    The final model is served via a FastAPI backend with a React frontend for the analytics dashboard. The API accepts JSON payloads with the 14 features and returns a probability distribution across income brackets, not just a binary prediction — giving users confidence scores alongside the classification.

    Key Takeaways

    1. Feature engineering matters more than model choice — creating interaction features between education and occupation improved accuracy by 2.1 percentage points.
    2. Class imbalance handling is essential — without SMOTE or class weights, the model achieves 76% accuracy but an F1 of only 0.45 on the minority class.
    3. Interpretability wins trust — SHAP value visualizations in the dashboard help users understand why the model made each prediction.

    The complete source code is available on GitHub.

    AR

    Abdul Rahman Azam

    Full Stack AI Engineer — building AI-powered products from model to deployment. Open to AI/ML opportunities.