On Support Vector Machines


Though a term support vector machines sounds pretty fancy, the basic idea behind them is pretty easy. For simplicity, let us assume that we have a binary classification problem with a dataset $\mathcal{D} = \{(x^{(i)},y^{(i)}\}$ where $x^{(i)}\in \mathbb{R}^n$ and $y^{(i)}\in \pm1$. Our aim is to find somehow 'optimal' hyperplane, high-dimensional version of plane, that separates positive and negative datapoints sufficiently well. To understand how do we find that optimal hyperplane let us transform our setting to $n$-dimensional Euclidean space.\\
Here we can define an operation of dot product between any two vectors, $\cdot:\mathbb{R}^n\times \mathbb{R}^n\to \mathbb{R}$. This dot product is defined as follows:
\[
x\cdot y = \sum_{i=1}^n x_iy_i
\]
where $x = (x_1,x_2,\ldots,x_n)$ and $y=(y_1,y_2,\ldots,y_n)$. Let us fix some vector $w$ and look at the dot product of the type:
\[
w\cdot x, \text{ where } \Vert x \Vert =1
\]
One might ask when is such product is maximized or minimized. Intuitively it is clear that:


  • $w\cdot x$ is maximum when $x=kw$ for $k>0$, i.e. $x$ is aligned with $w$.
  • $w\cdot x=0$, when $x\perp w$.
  • $w\cdot x$ is minimum when $x=-kw$ for $k>0$, i.e. $x$ is directed to opposite direction of $w$.


In this respect, dot product can be used to measure similarity in orientation between any two vectors. A large value of dot product indicates that two vectors are well-aligned. On the other hand a large negative value indicates that two vectors are kind of opposite to each other. Now let us look at all $x$ satisfying the following equation:
\[
w\cdot x = b
\]
Those points form a hyperplane. Points on the one side of the hyperplane satisfy $w\cdot x < b$, while points on the opposite side satisfy $w\cdot x> b$. So to define hyperplane we need to set values for $w$ and $b$. How should we find $w$ and $b$? Suppose that our datapoints are linearly separable, i.e. we assume existence of a hyperplane separating datapoints. Then to classify points correctly we require:
\[
w\cdot x^{(i)} - b \ge 1, \text{ if } y^{(i)}=1
\]
\[
w\cdot x^{(i)}- b \le -1,\text{ if } y^{(i)}=-1
\]
The reason we have $\pm1$ in above inequalities is to ensure some kind of confidence margin. Indeed, we would not wish to end up with a case $w\cdot x - b = 0.01$, then it is hard to judge whether we should classify this datapoint as positive or negative example. Above conditions can be summarized as follows:
\[
y^{(i)}(w\cdot x^{(i)}-b)\ge 1,\, \forall i
\]
There are might be several $w,b$ pairs satisfying above condition. How do we choose the optimal one among them? The idea is to use maxmin principle, i.e. we wish to maximize the minimum distance between datapoints and the hyperplane. So we look at the datapoints $x$ satisfying $w\cdot x - b = \pm 1$. These are so called support vector points, to which these algorithm owes its fancy name. Observe that the distance between hyperplane $w\cdot x = b$ and both hyperplanes $w\cdot x = b\pm 1$ is $\frac{1}{\Vert w \Vert}$. We wish to find $w,b$ pair so that this distance is greatest. So we need to look for $w$ such that $\Vert w \Vert$ is the least. Finally, our objective can be summarized as follows:
\[
\min \Vert w \Vert\\
\]
\[
\text{subject to }y^{(i)}(w\cdot x-b)\ge 1, \, \forall i
\]

Comments

Popular posts from this blog

On learning C++

Neural Networks as Morphisms

On ideas behind Bayesian Learning