Linear Regression¶
In this section, we will further explore the concepts of linear ML algorithms, but now our task will focus on predicting responses in terms of continuous values, instead of discrete classes as we did in linear classification. Taking the ML application mentioned in the introduction to ML, for instance, we may now want to predict how much is the amount of a chemical substance dissolved in water, rather than just if it is dissolved or not (binary classification).
Linear regression focuses on modeling the relationship between input variables (features) and a continuous target variable. It assumes a linear relationship between the input features and the target variable.
Here our goal is again to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the difference between predicted and actual target values. To this end, we will cover:
The least squares criterion for quantifying the training error in linear regression
The stochastic gradient descent (SDG) algorithm, which is used in training process of a linear regression model
The regularization term for linear regression
The sources of error in linear regression
Empirical Risk Minimization (ERM)¶
Objective function¶
As we saw in the introduction to ML, the goal of ML is to minimize the objective function by adjusting the model’s parameters through techniques (i.e., optimization algorithms) such as gradient descent.
One of the objective functions that we may use in linear regression is Empirical Risk (). We express the empirical risk in terms of a loss measure, which reflects only the deviation between model predictions and the target values (or labels) of our training dataset, and thus does not consider regularization. The goal of empirical risk () minimization (ERM) is to find a model that minimizes the discrepancy between predictions and observations on the training data, with the assumption that it will generalize well to unseen data. So we can define as following:
where is the number of training examples, is the -th training example (feature vector and label, respectively), and is a generic loss function. Note that $\cdot§ denotes a dot product.
One common way to express deviations between predictions and observations on the training data is to compute the squared error, , which yields the ordinary least squares (OLS) objective function:
Know more
Squaring the deviations between model predictions and label values to use as a loss function is a common practice in optimization problems for several reasons:
Simplicity: Squaring the deviations simplifies the mathematical formulation of the loss function. It eliminates the need to consider the direction of the deviation (positive or negative) and ensures that all deviations contribute positively to the loss. Additionally, squaring preserves the nice mathematical properties required for optimization, such as being differentiable and convex.
Emphasizing large errors: Squaring the deviations amplifies the impact of larger errors compared to smaller errors. By squaring the deviations, the loss function penalizes significant deviations more severely, which can be desirable in many applications. This emphasis on large errors can lead the optimization process to focus on reducing outliers and improving overall accuracy.
Differentiability: Squaring the deviations makes the loss function differentiable, which is crucial for optimization algorithms that rely on gradients to update the model parameters. The ability to compute derivatives allows efficient optimization using gradient-based methods like gradient descent or stochastic gradient descent. These methods iteratively adjust the model parameters in the direction that minimizes the loss.
Convexity: Squared loss is a convex function, meaning it has a single global minimum. Convexity simplifies the optimization process because it guarantees that the loss function has a unique solution, and optimization algorithms can converge to that solution reliably. Non-convex loss functions may have multiple local minima, which can make optimization more challenging.
Learning algorithm¶
Now, we will use the stochastic gradient descent (SDG) algorithm to update our model . Recall that we do this by adjusting the model parameters with the gradient of our objective function, i.e., empirical risk, evaluated at each training example. Thus, we nudge towards the direction opposite to the gradient . Note that the function above, defined with the squared error as loss function, is differentiable everywhere. We compute the gradient of the empirical risk, which yields:
Thus, we can summarize our learning algorithm as:
Initialize
Randomly pick
Update , so that:
where is the learning rate.
Note that this learning algorithm is very similar to the one for the case of linear classification.
Regularization: Ridge regression¶
Objective function¶
So far, our optimization problem for training a linear regression model has only focused on minimizing the training error (empirical risk minimization or ERM). However, a regularization term is crucial in most cases otherwise our model can’t generalize for other datasets (in addition to the training dataset in hands). Thus, we will now introduce a regularization term to our objective function, which now constitutes a ridge regression problem. Ridge regression introduces a regularization term, often called the “ridge penalty” or “L2 penalty” to the ordinary least squares (OLS) objective function. This penalty term () controls the complexity of the model by shrinking (i.e., regression coefficients) towards zero. Thus, the objective function for ridge regression is:
where is the regularization parameter we covered in the Linear Classification.
Learning algorithm¶
As we did in the Empirical Risk Minimization (ERM) method, we can also apply the stochastic gradient descent algorithm in ridge regression, only now we need to take the gradient of the new objective function () and use it to update at each iteration through the training dataset.
Let’s first expand all terms of :
The gradient can be computed now as:
Thus, we can summarize our learning algorithm as:
Initialize
Randomly pick
Update , so that:
where is the learning rate.
Note that by adding a regularization term to our objective function, we are now concerned with finding an optimal model that, rather than fitting the training data perfectly, it is able to generalize to other datasets as well. We do so because we believe that the model should not be adjusted to every single piece of weak evidence or noise contained in the training dataset. Instead, we introduce the regularization parameter , which avoids that changes except for when the evidence is strong enough to worth an increase of . As the value of increases, so does the training error, but with the hope that our model will generalize better, yielding a lower test error.
Structural vs. estimation error¶
When selecting a ML algorithm, we make certain assumptions about the relationship between the features and the labels. In the case of linear regression, the assumption is that the relationship between the features and the labels can be represented by a linear equation. If this assumption is violated, such as when the true relationship is nonlinear, then our model will have a high structural error because it cannot accurately capture the underlying patterns in the data. Thus, structural error encompasses the limitations or assumptions made by the chosen model, and it represents the irreducible error that cannot be eliminated regardless of the amount of training data. Estimation error, on the other hand, arises from the finite nature of the training data and the resulting inability of our model to fit or generalize from that data. Estimation errors can occur when the available training data is limited or does not adequately represent the true underlying distribution of the problem. In such cases, the model may struggle to capture the true patterns and relationships present in the data, leading to higher estimation errors.