Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Kernels

Kernels

In this section, we will cover fundamental concepts of non-linear classification by introducing the concept of kernels. First, let us recall what we have seen so far in our section about Linear Classification. In linear classification, our task consisted of classifying data points through a hyperplane that could linearly separate the dataset in the features coordinate space. For instance, in a 3d feature space, thus a feature vector such as (x1,x2,x3)R3(x_1, x_2, x_3) \in \mathbb{R}^3 , recall that our data is considered linearly separable if there is at least one plane (not line) who can split the points. Unlike linear classification, which assumes a linear relationship between input features and class labels, non-linear classification algorithms use various techniques to capture complex patterns and decision boundaries in the data. In particular, we will look at how we can transform our data into a new coordinate space of higher dimension through kernels, which help us turning the non-linear problem into a linear one.

Kernels allow us to transform data into a higher-dimensional feature space where linear separation becomes possible. One example of ML algorithm that relies on kernels for finding complex pattern and decision boundaries in the data is Support Vector Machine (SVM).

Feature transformation

We will now see how feature transformation works through a 1d example, that is, we have one feature xRx \in \mathbb{R}. The figure below illustrates the training points (n=3n=3).

Note from the figure that the dataset is not linearly-separable, at least not in the given feature space in 1 dimension. To turn this problem into a linear-problem, we can perform a feature transformation (ϕ(x)\phi (x)) to look for a decision boundary in a higher-dimensional space. In this particular example, note that we can transform the 1d feature into a new 2d feature vector, where the additional dimension can be seen as a sort of new feature.

xΦ(x)=[Φ1      Φ2]=[x      x2]x \to \Phi(x) = [\Phi_1 \; \; \; \Phi_2] = [x \; \; \; x^2]
Original feature space
New feature space
Decision boundary in the new feature space
initial problem before feature transformation

Figure 1:1: Training datase in the initial feature space.

By performing feature transformation as illustrated in the step 2: training dataset in the new feature space Φ(x)\Phi(x) (see figure above), we can find a classifier h(x,θ,θo)h(x, \theta, \theta_o) with a decision boundary defined by θ\theta and the offset parameter θ0\theta_0:

h(x,θ,θ0)=sign(θΦ(x)+θ0)h(x,θ,θ0)=sign(θ1x+θ2x2+θ0)h (x, \theta, \theta_0) = sign(\theta \cdot \Phi(x) + \theta_0)\\ \therefore h (x, \theta, \theta_0) = sign(\theta_1 x + \theta_2 x^2 + \theta_0)

More coming soon...