patternMinor
SVM optimization objective: why are we maximizing $\frac{1}{\|w\|}$?
Viewed 0 times
whyfracaresvmmaximizingoptimizationobjective
Problem
I'm learning about Support Vector Machines(SVM).
I understood that their objective is to maximize the margin between the decision line and the closest point to it.
For simplicity, let's assume we have a linear SVM and linearly separable data X.
We would like to find the maximum distance D and vector W, such that
$\dfrac{W^T \cdot X_i+W_0}{\|W\|}\geq D$ for all $X_i$ in the first class
$\dfrac{W^T \cdot X_i+W_0}{\|W\|}\leq-D$ for all $X_i$ in the second class.
Why is the problem equivalent to maximizing $\frac{1}{\|W\|}$, given the following constraints are satisfied:
$W^T\cdot X_i+W_0\geq1$ for all $X_i$ in the first class
$W^T\cdot X_i+W_0\leq-1$ for all $X_i$ in the second class.
I'm more interested in a formal proof than in the intuition.
I understood that their objective is to maximize the margin between the decision line and the closest point to it.
For simplicity, let's assume we have a linear SVM and linearly separable data X.
We would like to find the maximum distance D and vector W, such that
$\dfrac{W^T \cdot X_i+W_0}{\|W\|}\geq D$ for all $X_i$ in the first class
$\dfrac{W^T \cdot X_i+W_0}{\|W\|}\leq-D$ for all $X_i$ in the second class.
Why is the problem equivalent to maximizing $\frac{1}{\|W\|}$, given the following constraints are satisfied:
$W^T\cdot X_i+W_0\geq1$ for all $X_i$ in the first class
$W^T\cdot X_i+W_0\leq-1$ for all $X_i$ in the second class.
I'm more interested in a formal proof than in the intuition.
Solution
Let's slightly rewrite the condition you give: the quantity
$$ \frac{W^T \cdot X_i + W_0}{\|W\|} \mp D $$
is non-negative for $X_i$ in the first class (with $-D$), non-positive for $X_i$ in the second class (with $+D$). Now suppose $\lambda > 0$. Then the quantity
$$ \frac{(\lambda W)^T \cdot X_i + (\lambda W_0)}{\|\lambda W\|} \mp \lambda D $$
satisfies the same properties as before; in fact the two sets of inequalities are equivalent. Therefore we can assume without loss of generality that $\|W\| = 1$ (choose $\lambda = 1/\|W\|$), and so the quantity in question becomes
$$ W^T \cdot X_i + W_0 \mp D, $$
where we are promised that $\|W\| = 1$. Now divide everything by $D$ to get another quantity satisfying the same properties:
$$ (D^{-1}W)^T \cdot X_i + (D^{-1}W_0) \mp 1. $$
Maximizing $D$ is the same as minimizing $D^{-1} = \| D^{-1} W \|$, which is the same as maximizing $1/\|D^{-1} W \|$. Now imagine that instead of $W,W_0$, you were optimizing over $W'=D^{-1}W,W'_0=D^{-1}W_0$; for each choice of $W',W'_0$ you can recover $D = 1/\|W'\|$ and so $W,W_0$, which shows that both problems are equivalent. In these terms, you want the following quantity to satisfy the required properties:
$$ W'^T \cdot X_i + W'_0 \mp 1. $$
Under these conditions, you want to maximize $1/\|W'\|$.
$$ \frac{W^T \cdot X_i + W_0}{\|W\|} \mp D $$
is non-negative for $X_i$ in the first class (with $-D$), non-positive for $X_i$ in the second class (with $+D$). Now suppose $\lambda > 0$. Then the quantity
$$ \frac{(\lambda W)^T \cdot X_i + (\lambda W_0)}{\|\lambda W\|} \mp \lambda D $$
satisfies the same properties as before; in fact the two sets of inequalities are equivalent. Therefore we can assume without loss of generality that $\|W\| = 1$ (choose $\lambda = 1/\|W\|$), and so the quantity in question becomes
$$ W^T \cdot X_i + W_0 \mp D, $$
where we are promised that $\|W\| = 1$. Now divide everything by $D$ to get another quantity satisfying the same properties:
$$ (D^{-1}W)^T \cdot X_i + (D^{-1}W_0) \mp 1. $$
Maximizing $D$ is the same as minimizing $D^{-1} = \| D^{-1} W \|$, which is the same as maximizing $1/\|D^{-1} W \|$. Now imagine that instead of $W,W_0$, you were optimizing over $W'=D^{-1}W,W'_0=D^{-1}W_0$; for each choice of $W',W'_0$ you can recover $D = 1/\|W'\|$ and so $W,W_0$, which shows that both problems are equivalent. In these terms, you want the following quantity to satisfy the required properties:
$$ W'^T \cdot X_i + W'_0 \mp 1. $$
Under these conditions, you want to maximize $1/\|W'\|$.
Context
StackExchange Computer Science Q#15012, answer score: 7
Revisions (0)
No revisions yet.