Langrange Duality and SVM


Several posts ago, we have talked about SVMs. In particular how it is possible to formulate SVM objective as optimization technique. In this post we are going to learn about Lagrange Duality, an optimization method, which is going to help us to solve the optimization problem we ended up with last time. Let us start with a simpler optimization problem given as:
\[
\min_w\, f(w)
\]
subject to constraints:
\[
h_i(w) = 0, \quad i=1,\ldots, l.
\]
The classical way to solve such a problem is to write a Lagrangian function:
\[
L(w,\beta) = f(w) + \sum_{i=1}^l \beta_i h_i(w)
\]
and set
\[
\frac{\partial L}{\partial w_i} = 0, \quad \frac{\partial L}{\partial \beta_i}=0
\]
This works because at critical points of $L(w,\beta)$ we must have $h_i(w)=0$ for all $i$. Now let us consider a more complex optimization problem:
\[
\min f(w)
\]
subject to constraints:
\[
g_i(w) \le 0, \quad i = 1,\ldots,k
\]
and
\[
h_j(w) = 0,\quad j = 1,\ldots,l
\]
Let us call the above problem 'primal' problem. We write the generalized Lagrangian function as follows:
\[
L(w,\alpha,\beta) = f(w) + \sum\alpha_i g_i(w) + \sum \beta_j h_j(w)
\]
and let
\[
\theta_p(w) = \max_{\alpha,\beta;\alpha_i\ge 0}L(w,\alpha,\beta)
\]
Let us observe that if some constraints are violated, then $\theta_p(w) = \infty$. On the other hand, if all constraints are intact, then we have:
\[
\min_w f(w) = \min_w \theta_p(w) = \min_w \max_{\alpha,\beta;\alpha_i\ge 0}L(w,\alpha,\beta)
\]
Let us now look at a different problem:
\[
\theta_D(\alpha,\beta) = \min_w L(w,\alpha,\beta)
\]
Let us now look at
\[
\max_{\alpha,\beta;\alpha_i\ge 0}\theta_D(\alpha,\beta) = \max_{\alpha,\beta;\alpha_i\ge 0} \min_w L(w,\alpha,\beta)
\]
It is known that:
\[
d^* = \max_{\alpha,\beta;\alpha_i\ge 0} \min_w L(w,\alpha,\beta) \le \min_w \max_{\alpha,\beta;\alpha_i\ge 0} L(w,\alpha,\beta) = p^*
\]
Under certain conditions, we have that $d^* = p^*$. In particular, suppose that:


  • $f,g$ are convex;
  • $h_j$ are affine;
  • $g_i$ are strictly feasible, i.e. there is $w$ such that $g_i(w)<0$.


Under above conditions, there exists $w^*, \alpha^*, \beta^*$ so that $w^*$ is the solution for the primal problem, while $\alpha^*,\beta^*$ are the solution to the dual problem. Furthermore, $p^* = d^* = L(w^*, \alpha^*, \beta^*)$ with $\alpha^*, \beta^*$ and $w^*$ satisfying Karush-Kuhn-Tucker conditions:
\begin{align*}
\frac{\partial}{\partial w_i}L(w^*, \alpha^*, \beta^*) = 0,\quad i = 1,\ldots,n\\
\frac{\partial}{\partial \beta_j} L(w^*,\alpha^*, \beta^*) = 0,\quad j=1,\ldots,l\\
\alpha_i^* g_i(w^*) = 0, \quad i = 1,\ldots,k\\
g_i(w^*)\le 0, \quad i = 1,\ldots, k\\
\alpha_i^* \ge 0, \quad i = 1,\ldots, k
\end{align*}
Moreover, if $(w^*, \alpha^*, \beta^*)$ satisfy KKT conditions, then it is a solution to primal and dual problems.

Application of Lagrange Duality to SVMs

Recall SVM optimization problem we have stopped at:
\[
\min_{w,b} \frac{1}{2}\Vert w \Vert^2
\]
subject to constraints:
\[
y^{(i)}(w^Tx^{(i)} + b)\ge 1, \quad i = 1, \ldots, m
\]
Let us rewrite the constraints as
\[
g_i(w) = 1 - y^{(i)}(w^Tx^{(i)}+b)\le 0
\]
So Langrangian function is
\[
L(w,b,\alpha) = \frac{1}{2}\Vert w \Vert^2 - \sum_{i=1}^m \alpha_i(y^{(i)}(w^Tx^{(i)}+b)-1)
\]
Computing gradients, we have:
\begin{align*}
\nabla_wL = w - \sum_{i=1}^m\alpha_i y^{(i)}x^{(i)} = 0 \\
\frac{d L}{d b} = - \sum_{i=1}^m \alpha_i y^{(i)} = 0
\end{align*}
From these equation we obtain

  • $w = \sum_{i=1}^m \alpha_i y^{(i)}x^{(i)}$
  • $\sum_{i=1}^m \alpha_i y^{(i)} = 0$

Putting these identities back, we obtain:
\[
L(w,\alpha, \beta) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i,j=1}^m y^{(i)}y^{(j)}\alpha_i \alpha_j \langle x^{(i)},x^{(j)}\rangle
\]
The dual problem looks as follows:
\[
\max_{\alpha} \left[ W(\alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i,j=1}^m y^{(i)}y^{(j)}\alpha_i \alpha_j \langle x^{(i)},x^{(j)}\rangle \right]
\]
subject to constraints
\[
\alpha_i\ge 0, \quad i = 1, \ldots, m
\]
and
\[
\sum_{i=1}^m \alpha_i y^{(i)} = 0
\]
If we find $\alpha$, we can find $w = \sum_{i=1}^m \alpha_i y^{(i)}x^{(i)}$. To find $b$ we observe that:
\[
b = - \frac{\max_{i:neg}\langle w^*, x^{(i)} \rangle + \min_{i:pos}\langle w^*,x^{(i)}\rangle}{2}
\]
To make a prediction on $w$, we need to calculate:
\begin{align*}
\langle w,x \rangle &= \langle \sum_i \alpha_i y^{(i)}x^{(i)}, x \rangle + b\\
&= \sum_i \alpha_i y^{(i)}\langle x^{(i)},x \rangle + b
\end{align*}
Observe that $\alpha_i=0$ if $x^{(i)}$ is not a support vector, so we can ignore non-support vectors.

In conclusion, let us remark that the dual form of the original optimization problem shed light on the structure of the problem and allowed to write everything in terms of inner products. One can replace inner products with more general functions like kernels. This is, however, a matter for another post.

References:
1. Andrew Ng, CS229 Lecture Notes, Part V, Support Vector Machines.

Comments

Popular posts from this blog

On learning C++

Neural Networks as Morphisms

On ideas behind Bayesian Learning