Wine Quality Prediction

Modern Machine Learning Analysis of Portuguese Vinho Verde Wine

Using Advanced Gradient Boosting & Explainable AI

Overview

-
Total Wines
-
Best Accuracy
-
Avg Quality
15
Features

About This Project

This project analyzes Portuguese "Vinho Verde" wine quality using modern machine learning techniques. The dataset contains 6,497 wine samples with 11 physicochemical properties.

What's New?
  • Advanced Models: XGBoost, LightGBM, CatBoost
  • Explainable AI: SHAP values for interpretability
  • Multiple Approaches: Classification & Regression
  • Clustering: Discovered natural wine groupings
  • Interactive Viz: Plotly-based visualizations
Wine Features
  • Fixed & Volatile Acidity
  • Citric Acid & Residual Sugar
  • Chlorides & Sulfur Dioxide
  • Density, pH, Sulphates
  • Alcohol Content & Wine Type

Data Exploration

PCA Visualization - Colored by Quality
UMAP Visualization - Colored by Cluster

Machine Learning Models

Model Performance Comparison

Best Model: Loading...

We trained and compared four different models:

Detailed Visualizations

Binary Classification
Regression Predictions

Key Insights

Feature Importance (SHAP Analysis)

SHAP (SHapley Additive exPlanations) values show which features contribute most to predictions:

Note: SHAP visualizations will be generated when you run the wine-quality-ml-updated.ipynb notebook. The analysis will show:
  • Feature Importance: Which features have the most impact on predictions
  • SHAP Summary Plot: How feature values (high/low) affect quality predictions
  • Individual Predictions: Waterfall plots explaining specific wine predictions
Top 5 Most Important Features:
  1. Alcohol - Higher alcohol content strongly predicts higher quality (positive correlation: 0.48)
  2. Volatile Acidity - High volatile acidity (vinegar taste) predicts lower quality (negative correlation: -0.39)
  3. Sulphates - Wine additives that act as preservatives (positive correlation: 0.25)
  4. Chlorides - Saltiness negatively affects quality perception (negative correlation: -0.21)
  5. Total Sulfur Dioxide - Preservation levels impact quality (negative correlation: -0.19)

What We Learned

Model Performance
  • Gradient boosting models (XGBoost, LightGBM, CatBoost) significantly outperform traditional Random Forest
  • Best accuracy: ~72% for multi-class classification (predicting exact quality scores 4-8)
  • Binary classification (Good vs Not Good) achieves higher accuracy (~80%)
  • Regression approach provides continuous quality estimates with low RMSE
Wine Quality Factors
  • Alcohol content is the strongest positive predictor of quality
  • Volatile acidity negatively impacts perceived quality
  • Chlorides (saltiness) correlate with lower quality
  • Wine type (red vs white) influences quality patterns
  • 4 natural clusters discovered with distinct characteristics

Clustering Analysis