How We Helped a University of Chicago Student Improve His Machine Learning Regression Project

Q: What is data leakage in Machine Learning?

Data leakage occurs when information from the test dataset is unintentionally used during model training. This makes performance metrics appear better than they actually are. A common cause is applying scaling or preprocessing before splitting the dataset.

Q: What evaluation metrics are important in regression projects?

Common regression evaluation metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²). These metrics help measure how accurately a model predicts target values.

Q: What are common mistakes students make in ML regression projects?

Common mistakes include not handling missing values properly, scaling before train-test split, relying on only one algorithm without comparison, ignoring overfitting, and misinterpreting evaluation metrics. Addressing these issues improves both model performance and conceptual clarity.

Project Overview

Mickey, a Master’s student in Data Science at the University of Chicago, approached us with just one week left before submitting his Machine Learning project.

The task was to build a regression model that could predict house prices using structured data. The dataset included features like square footage, number of bedrooms, location, and year built.

He knew basic Python. But this was his first real-world ML project where everything, data cleaning, model selection, evaluation, and reporting, had to come together correctly.

He didn’t want someone to just “do it for him.” He wanted to understand what he was doing.

That’s where we stepped in.

The Actual Problem He Was Facing

When we reviewed his code and dataset, we identified four major issues.

1. Data Preprocessing Confusion

Missing values were left untreated.
Some categorical columns were not encoded properly.
Feature scaling was inconsistent.

He wasn’t sure which preprocessing steps were necessary and why.

2. Model Selection Uncertainty

He had tried Linear Regression but the results were weak.

R² Score: 0.59
Mean Absolute Error: Very high variance

He wasn’t sure whether to:

Improve Linear Regression
Switch to Random Forest
Try multiple algorithms

3. Improper Train-Test Workflow

He scaled the entire dataset before splitting.

This created data leakage, which inflated performance artificially. This is a common beginner mistake in ML projects.

During one session, his model suddenly showed an R² score of 0.99, which looked impressive at first. But we quickly realized that scaling was done before splitting the dataset. That small mistake caused data leakage and made the model appear more accurate than it actually was.

Once we corrected the workflow and retrained the model properly, the performance stabilized at a realistic R² of 0.83.

4. Weak Model Interpretation

He calculated metrics but didn’t fully understand:

What MAE actually meant
How to interpret R²
Whether his model was overfitting

Solution: Our Step-by-Step Approach

Instead of rewriting everything, we worked collaboratively with him. We paired him with one of our ML mentors who had previously worked on regression-based academic projects.

We scheduled structured sessions and rebuilt the workflow properly.

Step 1 – Clean and Prepare the Dataset Correctly

We first focused only on data.

Filled numerical missing values using mean imputation.
Used mode for categorical features.
Applied encoding where necessary.
Performed feature scaling after train-test split.

Example:

				
					from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

We explained why scaling must only be fit on training data. That clarity changed his understanding completely.

Step 2 – Testing Multiple Algorithms

Instead of guessing, we compared models.

Linear Regression
Random Forest Regressor

Random Forest handled outliers and nonlinear patterns much better.

				
					from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

Step 3 – Proper Model Evaluation

We evaluated using:

Mean Absolute Error (MAE)
R-squared Score

				
					from sklearn.metrics import mean_absolute_error, r2_score

predictions = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

Step 4: Visualizing Results

After evaluating the model we helped Mickey create visualizations of predicted vs. actual house prices using Matplotlib: You can check the sample code below for better understanding.

				
					import matplotlib.pyplot as plt

plt.scatter(y_test, predictions)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs Predicted Prices")
plt.show()

Results After Optimization

After correcting preprocessing and switching models:

Before Improvement:

R² Score: 0.59
MAE: High error spread

After Improvement:

R² Score: 0.83
MAE: Significantly reduced

More importantly, he understood why the improvement happened.

What He Learned During This Project

By the end of the sessions, Mickey was able to:

Explain why preprocessing impacts model performance
Avoid data leakage
Compare regression algorithms properly
Interpret performance metrics confidently
Create meaningful visualizations

He submitted his project with a clear technical report and received strong academic feedback.

Technical Summary of the Project

Problem Type: Supervised Regression
Dataset Size: ~2,000 records
Algorithms Tested: Linear Regression, Random Forest
Final Model: Random Forest Regressor
Evaluation Metrics: MAE, R²
Libraries Used: Pandas, NumPy, Scikit-learn, Matplotlib
Final R² Score: 0.83

Testimonial :

“I finally understood how ML projects actually work. The sessions helped me connect theory with implementation. I felt confident while submitting my project.” – Mickey (USA)

Our Learning Philosophy

When students approach us with complex projects like Machine Learning, our focus is not just code completion.

We guide them through:

Understanding the dataset
Building models correctly
Avoiding common mistakes
Writing clean reports
Strengthening conceptual clarity

That’s how real learning happens.

Frequently Asked Questions (FAQ)

Why did the model show an R² score of 0.99 initially?

The unusually high R² score was caused by data leakage. The dataset was scaled before being split into training and testing sets. This allowed information from the test data to influence the training process, which artificially inflated performance. Once the workflow was corrected, the model showed a realistic R² score of 0.83.

Why did Random Forest perform better than Linear Regression in this project?

Linear Regression assumes a linear relationship between features and the target variable. However, house price data often contains nonlinear patterns and outliers. Random Forest handles such complexity better by combining multiple decision trees, which improved prediction accuracy in this case.

What is data leakage in Machine Learning?

Data leakage happens when information from the test dataset is unintentionally used during training. This makes the model appear more accurate than it actually is. A common cause is applying preprocessing techniques like scaling or encoding before splitting the dataset.

What evaluation metrics are important in regression projects?

For regression problems, common evaluation metrics include:

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-squared (R²)

These metrics help measure how close the model’s predictions are to actual values.

What are common mistakes students make in ML regression assignments?

Some common mistakes include:

Not handling missing values properly
Scaling before train-test split
Using only one algorithm without comparison
Ignoring overfitting
Misinterpreting evaluation metrics

Understanding these mistakes helps improve both model performance and conceptual clarity.

Need Guidance on Your Machine Learning Project?

If you’re working on regression, classification, or any ML assignment and feel stuck in preprocessing or model selection, structured guidance can make a big difference.