Machine Learning

Linear Regression Tutorial

This notebook explores linear regression, detailing its mathematical foundation and practical application. It guides users through data loading, exploration, model training, and evaluation using metrics like R-squared, MSE, and RMSE. The notebook emphasizes the importance of testing on unseen data to ensure model generalization and avoid overfitting, providing a strong foundation in this fundamental machine learning algorithm.

Linear Regression Tutorial

Welcome to this comprehensive guide on Linear Regression! We'll explore one of the most fundamental and widely-used algorithms in machine learning and statistics.

What is Linear Regression?

Linear regression is a statistical method that models the relationship between:

Dependent variable (y): The outcome we want to predict
Independent variable (x): The input feature we use to make predictions

The goal is to find the best-fitting straight line through our data points. This line can then be used to make predictions on new, unseen data.

The Mathematical Foundation

Linear regression assumes the relationship between x and y can be expressed as:

y = mx + b + ε

Where:

m = slope (how much y changes for each unit change in x)
b = y-intercept (value of y when x = 0)
ε = error term (the difference between predicted and actual values)

Why Linear Regression Matters

✅ Simple to understand and interpret

✅ Fast to train and predict

✅ Great baseline for more complex models

✅ Provides insights into feature importance

✅ Works well when relationships are approximately linear

Let's dive into our dataset and see linear regression in action!

Training Data Summary: Number of samples: 700 X range: 0.0 to 3530.2 Y range: -3.8 to 108.9 Test Data Summary: Number of samples: 300 X range: 0.0 to 100.0 Y range: -3.5 to 105.6

Exploring Our Dataset

Before building any model, it's crucial to visualize and understand our data. This helps us:

• Identify patterns: Does the relationship look linear?

• Spot outliers: Are there unusual data points?

• Check data quality: Are there missing values or errors?

• Understand the scale: What's the range of our variables?

From our initial exploration:

Training set: 700 samples with x ranging from 0 to 3530, y from -3.8 to 108.9
Test set: 300 samples with x ranging from 0 to 100, y from -3.5 to 105.6

Let's visualize the training data to see if there's a linear relationship:

train_pl

Building Our Linear Regression Model

From the scatter plot, we can see that there appears to be a strong linear relationship between x and y values! This makes linear regression a great choice for this dataset.

How Linear Regression Works

Linear regression finds the best-fit line by minimizing the sum of squared errors. This means:

📊 For each data point, calculate the difference between the actual y value and the predicted y value
➡️ Square these differences (to make them all positive and penalize larger errors more)
🎯 Sum all squared differences and find the line that minimizes this total

This process is called Ordinary Least Squares (OLS) and gives us the optimal values for:

Slope (m): How steep our line is
Intercept (b): Where our line crosses the y-axis

Let's train our model and see what it learns!

Checking for missing values... Training data - X missing: 0 Training data - Y missing: 1 Test data - X missing: 0 Test data - Y missing: 0 Original training size: 700, After cleaning: 699 Original test size: 300, After cleaning: 300 🎉 Model Training Complete! Learned Parameters: Slope (m): 1.0007 Intercept (b): -0.1073 Equation: y = 1.0007x + -0.1073 Model Interpretation: • For every 1 unit increase in x, y increases by 1.0007 units • When x = 0, the model predicts y = -0.1073

Evaluating Our Model's Performance

Now that we've trained our model, we need to measure how well it performs. This tells us:

• How accurate are our predictions?

• How much error does our model make?

• How well does our model capture the relationship?

Key Metrics for Linear Regression

1. R-squared (R²) - Coefficient of Determination

Measures how much of the variation in y is explained by x
Range: 0 to 1 (higher is better)
0.8+ is considered good, 0.9+ is excellent

2. Mean Squared Error (MSE)

Average of squared differences between actual and predicted values
Lower values indicate better performance
Units are squared (e.g., if y is in dollars, MSE is in dollars²)

3. Root Mean Squared Error (RMSE)

Square root of MSE, so it's in the same units as y
Easier to interpret than MSE
Tells us the typical prediction error

Let's calculate these metrics for our model:

📏 Training Set Performance: R-squared (R²): 0.9907 Mean Squared Error: 7.8678 Root Mean Squared Error: 2.8050 💡 Model Performance Interpretation: • EXCELLENT fit! The model explains 99.1% of the variation in y • On average, predictions are off by 2.80 units • The relationship is almost perfectly linear (slope ≈ 1.0)!

test_with_pred_...

train_with_pred...

🎯 Test Set Performance: R-squared (R²): 0.9888 Mean Squared Error: 9.4329 Root Mean Squared Error: 3.0713 🔄 Training vs Test Comparison: R² - Training: 0.9907 | Test: 0.9888 | Difference: 0.0019 RMSE - Training: 2.8050 | Test: 3.0713 | Difference: 0.2664 ✅ EXCELLENT generalization! Performance is very consistent. 📝 Sample Predictions: x=77.0 → Actual: 79.78, Predicted: 76.94, Error: 2.83 x=21.0 → Actual: 23.18, Predicted: 20.91, Error: 2.27 x=22.0 → Actual: 25.61, Predicted: 21.91, Error: 3.70 x=20.0 → Actual: 17.86, Predicted: 19.91, Error: 2.05 x=36.0 → Actual: 41.85, Predicted: 35.92, Error: 5.93

test_with_pred_...