patternMinor

Kernelization trick, for neural networks

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

trickkernelizationneuralnetworksfor

Problem

I've been learning about neural networks and SVMs. The tutorials I've read have emphasized how important kernelization is, for SVMs. Without a kernel function, SVMs are just a linear classifier. With kernelization, SVMs can also incorporate non-linear features, which makes them a more powerful classifier.

It looks to me like one could also apply kernelization to neural networks, but none of the tutorials on neural networks I've seen have mentioned this. Do people commonly use the kernel trick with neural networks? I presume someone must have experimented with it to see if it makes a big difference. Does kernelization help neural networks as much as it helps SVMs? Why or why not?

(I can imagine several ways to incorporate the kernel trick into neural networks. One way would be to use a suitable kernel function to preprocess the input, a vector in $\mathbb{R}^n$, into a higher-dimensional input, a vector in $\mathbb{R}^{m}$ for $m\ge n$. For multiple-layer neural nets, another alternative would be to apply a kernel function at each level of the neural network.)

Solution

I think you might be confusing the terminology in a way that is making the issue confusing. SVMs work by defining a linear decision boundary, i.e., a hyperplane. We can define this hyperplane in terms of inner products between the points. Therefore, if we define this inner product to be in some high-dimensional, or even infinite dimensional space, what looks like a hyperplane in this new space is a not necessary linear in the original feature space. So everything is still linear, the only thing we've done is to implicitly (via the new inner-product) embed the points in some higher dimensional space. Maybe you already know all this.

There are 2 issues to consider with respect to neural networks. The first was brought up by @Yuval Filmus, because of the hidden layer neural networks depend on more than just the inner products between the points. If you remove the hidden layer, you just have something like logistic regression, of which there are kernelized versions. Maybe there is a way to get around this, but I don't see it.

Secondly, you mention preprocessing the input by projecting into a higher, but not infinite, dimensional space. Neural networks define a decision surface and this surface is not constrained to be linear. This means the gain from projecting the points into a higher dimensional space will be different, i.e., it may make it easier to find a good set of weights, but we haven't necessarily made our model any more powerful. This follows from the Universal approximation theorem that tells us given a large enough number of hidden units we can approximate any function (under some restrictions). This last statement is rather vacuous and I kind of hate to mention it. By not telling you anything about how to find the right weights it doesn't bring much to the table from an application perspective.

Context

StackExchange Computer Science Q#16220, answer score: 6

Revisions (0)

No revisions yet.