Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Linear Classification

Linear Classification

In this section, we will cover the fundamentals of linear classification through a simple ML algorithm, the Perceptron. In addition, we will extend the concepts behind the Perceptron algorithm by considering aspects of regularization to build a margin linear classifier.

Hyperplanes

Suppose we wish to classify positive and negative objects from the training set of points below (figure on the left):

decisionbound

Figure 1:Training set of points with binary labels (+1, -1) and two-dimensional (x1,x2)(x_1, x_2) features. The decision boundary (grey line) is defined by the parameter vector θ\theta, which is normal to the decision boundary, and offset parameter θ0\theta_0 that linearly separates the data.

The dataset above is considered linearly separable because it exists at least one linear decision boundary capable of splitting the entire dataset correctly. For instance, we could pass a decision boundary like the grey line above (figure on the right)

In this case, since the features (x1,x2)R2(x_1, x_2) \in \mathbb{R}^2 , that is, the feature set belongs to the two-dimensional space, the decision boundary constitutes a line. If we were dealing with a set of features in the three-dimensional space (x1,x2,x3)(x_1, x_2, x_3), the decision boundary would be a plane. Analogously, if our feature set were in a higher-dimensional space, the decision boundary would constitute a hyperplane.

A hyperplane with dd dimensions is conventionally denoted by the vector normal to the plane, θRd\theta \in \mathbb{R}^d, and offset (scalar) parameter θ0\theta_0. In the example above, we would define the hyperplane (or decision boundary) as:

θX+θ0=0[θ1θ2][x1x2]+θ0=0\theta \cdot X + \theta_0 =0 \equiv \begin{bmatrix} \theta_1 & \theta_2 \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \theta_0 = 0

Our classifier h(x,θ,θ0)h(x, \theta, \theta_0) is thus equal to sign(θX+θ0)sign(\theta \cdot X + \theta_0) where θR2\theta \in \mathbb{R}^2 and θ0R\theta_0 \in \mathbb{R}. Recall the sign function, also known as the signum function, is a mathematical function that returns the sign or direction of a real number. That is, if the input number is positive, negative, or 0, the sign function returns +1, -1, or 0, respectively.

Perceptron algorithm

In the perceptron algorithm, we typically initialize θ\theta as zero (zero vector) and loop through the pair of training examples. At every iteration, we will check if the classifier makes a mistake classifying that training example (i-th example), and if so, then we update the parameters of θ\theta.

Assume that θ0=0\theta_0 =0 for simplicity (the decision boundary must pass through the origin). Our perceptron classifier will make a mistake if y(i)(θx(i))0y^{(i)}(\theta \cdot x^{(i)}) \leq 0. We will then update our θ\theta to no longer misclassify that training example. The way to do this is by adding y(i)x(i)y^{(i)}x^{(i)} to the previous θ\theta. Thus, the update would look like:

θ=θ+y(i)x(i)\theta = \theta + y^{(i)}x^{(i)}

We have in hands a set of different training examples which have the potential to nudge/update our classifier in many directions. Thus, it is possible and even expected that the last training examples cause updates that will overwrite earlier, initial updates. This will result that earlier examples will no longer be correctly classified. For this reason, we need to loop through the whole training set multiple TT times to ensure that all examples are correctly classified. Such iterations can be performed both in order or randomly selected from the training examples.

We can code the algorithm as following:

import numpy as np


# Algorithm always starting to loop from x1
def perceptron(X, y, theta, t_times):
    n_mistakes = 0

    # Initialize list to show the progress (updates) of theta
    progress_theta = []

    # Initialize an array with same size as the total number of examples to count how many mistake are made at each training example
    explicit_mistakes = np.zeros(shape=y.shape[0])

    # Loop through the training set T times
    for t in t_times:

        # Loop through the training examples in order
        for index, x in enumerate(X):

            # Check if the algorithm makes a mistake in the i-th (or index-th) example
            if y[index] * np.dot(theta, x) <= 0:
                # Update theta to no longer misclassify the i-th example
                theta = theta + y[index] * x

                # Save the update theta
                progress_theta.append(theta)

                # Update total number of mistakes
                n_mistakes += 1

                # Update total number of mistakes at the i-th training example
                explicit_mistakes[index] += 1
    print('The perceptron did {} mistakes until convergence'.format(n_mistakes))
    return progress_theta, n_mistakes, explicit_mistakes


if __name__ == '__main__':
    X = np.array([[-1, -1], [1, 0], [-1, 1.5]])
    # X = np.array([[-1, -1], [1, 0], [-1, 10]])

    y = np.array([1, -1, 1])

    t_times = range(0, 100)

    theta = np.array([-1, -1])

    a, b, c = perceptron(X, y, theta, t_times)

Margin boundaries and hinge loss

As you may have noticed, the perceptron algorithm features no regularization term. The goal was simply to find any decision boundary that can split the data correctly. Here, we will introduce the concept of hinge loss and margin boundaries to transform the problem of learning a decision boundary into an optimization problem considering regularization.

Motivation behind margin boundaries

Let’s take a look at our previously presented training dataset (figure below). Any decision boundary within the dashed grey lines correctly splits the training examples. However, intuitively, we would like to favor a decision boundary that can maximize the distances between the decision boundary and the training points. The reason to do so is because it’s likely that the points we wish to classify in the future have statistical noise, such that a decision boundary located too close to the training examples is more likely to misclassify slightly changed (noisier) versions of the training examples. In contrast, a classifier that holds a relatively higher margin between the decision boundary and the examples will be probably more successful in classifying future, unseen data.

marginbound

Figure 3:Training set of points with binary labels (+1, -1) in the two-dimensional feature space (x1,x2)(x_1, x_2). Any decision boundary within the dashed grey lines can correctly split the data.

Optimization problem

Recall that our goal is to find a linear classifier that maximizes the distances between the decision boundary and the training points (margin linear classifier), but also minimizes the training error. Thus, this constitutes an optimization problem that needs to counterbalance these two factors, which we can state as:

Margin boundaries

Previously, we saw that the equation defining a decision boundary satisfies θX+θ0=0\theta \cdot X + \theta_0 =0.

We can now define parallel margin boundaries (dashed line in the previous figure) as:

θX+θ0=±1\theta \cdot X + \theta_0 = \pm 1

Note that we can define the boundaries like this because we have a degree of freedom in our definition of the decision boundary, namely, the magnitude of the normal vector θ\| \theta \|. That is, regardless of the value θ\| \theta \|, our decision boundary remains unaltered.

Recall the problem of computing the smallest distance of a point to a plane. This distance is:

θx(i)+θ0θ\frac{\theta \cdot x^{(i)} + \theta_0 }{\| \theta \|}

We can now calculate the signed distance between the decision boundary and the i-th example as:

γi(θ,θ0)=θx(i)+θ0θ\gamma_i (\theta, \theta_0) = \frac{\theta \cdot x^{(i)} + \theta_0 }{\| \theta \|}

Thus, the distance between the margin boundaries and the decision boundary is:

γi(θ,θ0)=1θ\gamma_i (\theta, \theta_0) = \frac{1}{\| \theta \|}

Hinge loss

We so far know that sign(θx(i)+θ0)sign(\theta \cdot x^{(i)} + \theta_0 ) will classify the i-th example. The way to know if the classification agrees with the label is by multiplying it by y(i)y^{(i)}. We can express this agreement also in a slightly modified version, using the hinge loss:

Lossh(z)={=0    if    z1=1z    if    <1Loss_h(z)= \begin{cases} = 0 \;\; \mbox{if} \;\; z \geq 1 \\ =1-z \;\; \mbox{if} \;\;< 1\end{cases}

where zz is the agreement (signed distance from the decision boundary) y(i)(θx(i)+θ0)y^{(i)}(\theta \cdot x^{(i)}+\theta_0).

The figure below illustrates how the hinge loss operates along the z-axis (distance from boundary):

hinge-loss-function

Objective function

So now we can create an objective function that (1) minimizes the average hinge loss over the training examples, and (2) maximizes 1θ\frac{1}{\| \theta \|}. Expression (2) can be also reformulated towards minimizing 12θ2\frac{1}{2}\| \theta \|^2. Thus we define the objective function as:

C(θ,θ0)=1ni=1nLossh(y(i)(θx(i)+θ0))+λ2θ2C(\theta, \theta_0) = \frac{1}{n}\sum_{i=1}^n Loss_h(y^{(i)}(\theta \cdot x^{(i)}+\theta_0)) + \frac{\lambda}{2} \| \theta \|^2

where λ\lambda is the regularization parameter that balances the importance of minimizing the regularization term λ2θ2\frac{\lambda}{2}\| \theta \|^2 at the cost of incurring more losses (increasing the loss term). Vice versa, the smaller the value of λ\lambda, the more emphasis we will give to minimizing average loss.