How We Helped a University of Chicago Student Improve His Machine Learning Regression Project

Project Overview

Mickey, a Master’s student in Data Science at the University of Chicago, approached us with just one week left before submitting his Machine Learning project.

The task was to build a regression model that could predict house prices using structured data. The dataset included features like square footage, number of bedrooms, location, and year built.

He knew basic Python. But this was his first real-world ML project where everything, data cleaning, model selection, evaluation, and reporting, had to come together correctly.

He didn’t want someone to just “do it for him.” He wanted to understand what he was doing.

That’s where we stepped in.

The Actual Problem He Was Facing

When we reviewed his code and dataset, we identified four major issues.

1. Data Preprocessing Confusion

  • Missing values were left untreated.

  • Some categorical columns were not encoded properly.

  • Feature scaling was inconsistent.

He wasn’t sure which preprocessing steps were necessary and why.

2. Model Selection Uncertainty

He had tried Linear Regression but the results were weak.

R² Score: 0.59
Mean Absolute Error: Very high variance

He wasn’t sure whether to:

  • Improve Linear Regression

  • Switch to Random Forest

  • Try multiple algorithms

3. Improper Train-Test Workflow

He scaled the entire dataset before splitting.

This created data leakage, which inflated performance artificially. This is a common beginner mistake in ML projects.

During one session, his model suddenly showed an R² score of 0.99, which looked impressive at first. But we quickly realized that scaling was done before splitting the dataset. That small mistake caused data leakage and made the model appear more accurate than it actually was.

Once we corrected the workflow and retrained the model properly, the performance stabilized at a realistic R² of 0.83.

 

4. Weak Model Interpretation

He calculated metrics but didn’t fully understand:

  • What MAE actually meant

  • How to interpret R²

  • Whether his model was overfitting

Solution: Our Step-by-Step Approach

Instead of rewriting everything, we worked collaboratively with him. We paired him with one of our ML mentors who had previously worked on regression-based academic projects.

We scheduled structured sessions and rebuilt the workflow properly.

Step 1 – Clean and Prepare the Dataset Correctly

We first focused only on data.

  • Filled numerical missing values using mean imputation.

  • Used mode for categorical features.

  • Applied encoding where necessary.

  • Performed feature scaling after train-test split.

Example:

				
					from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

				
			

We explained why scaling must only be fit on training data. That clarity changed his understanding completely.

Step 2 – Testing Multiple Algorithms

Instead of guessing, we compared models.

  • Linear Regression

  • Random Forest Regressor

Random Forest handled outliers and nonlinear patterns much better.

				
					from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

				
			

Step 3 – Proper Model Evaluation

We evaluated using:

  • Mean Absolute Error (MAE)

  • R-squared Score

 
				
					from sklearn.metrics import mean_absolute_error, r2_score

predictions = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

				
			

Step 4: Visualizing Results

After evaluating the model we helped Mickey create visualizations of predicted vs. actual house prices using Matplotlib: You can check the sample code below for better understanding.

				
					import matplotlib.pyplot as plt

plt.scatter(y_test, predictions)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs Predicted Prices")
plt.show()

				
			

Results After Optimization

After correcting preprocessing and switching models:

Before Improvement:

  • R² Score: 0.59

  • MAE: High error spread

After Improvement:

  • R² Score: 0.83

  • MAE: Significantly reduced

More importantly, he understood why the improvement happened.

What He Learned During This Project

By the end of the sessions, Mickey was able to:

  • Explain why preprocessing impacts model performance

  • Avoid data leakage

  • Compare regression algorithms properly

  • Interpret performance metrics confidently

  • Create meaningful visualizations

He submitted his project with a clear technical report and received strong academic feedback.

Technical Summary of the Project

  • Problem Type: Supervised Regression

  • Dataset Size: ~2,000 records

  • Algorithms Tested: Linear Regression, Random Forest

  • Final Model: Random Forest Regressor

  • Evaluation Metrics: MAE, R²

  • Libraries Used: Pandas, NumPy, Scikit-learn, Matplotlib

  • Final R² Score: 0.83

Testimonial :

“I finally understood how ML projects actually work. The sessions helped me connect theory with implementation. I felt confident while submitting my project.” – Mickey (USA)

Our Learning Philosophy

When students approach us with complex projects like Machine Learning, our focus is not just code completion.

We guide them through:

  • Understanding the dataset

  • Building models correctly

  • Avoiding common mistakes

  • Writing clean reports

  • Strengthening conceptual clarity

That’s how real learning happens.

Frequently Asked Questions (FAQ)

The unusually high R² score was caused by data leakage. The dataset was scaled before being split into training and testing sets. This allowed information from the test data to influence the training process, which artificially inflated performance. Once the workflow was corrected, the model showed a realistic R² score of 0.83.

Linear Regression assumes a linear relationship between features and the target variable. However, house price data often contains nonlinear patterns and outliers. Random Forest handles such complexity better by combining multiple decision trees, which improved prediction accuracy in this case.

Data leakage happens when information from the test dataset is unintentionally used during training. This makes the model appear more accurate than it actually is. A common cause is applying preprocessing techniques like scaling or encoding before splitting the dataset.

For regression problems, common evaluation metrics include:

  • Mean Absolute Error (MAE)

  • Mean Squared Error (MSE)

  • Root Mean Squared Error (RMSE)

  • R-squared (R²)

These metrics help measure how close the model’s predictions are to actual values.

Some common mistakes include:

  • Not handling missing values properly

  • Scaling before train-test split

  • Using only one algorithm without comparison

  • Ignoring overfitting

  • Misinterpreting evaluation metrics

Understanding these mistakes helps improve both model performance and conceptual clarity.

Need Guidance on Your Machine Learning Project?

If you’re working on regression, classification, or any ML assignment and feel stuck in preprocessing or model selection, structured guidance can make a big difference.