patternModerate
Showing that Bayes classifier is optimal
Viewed 0 times
showingoptimalbayesclassifierthat
Problem
Consider domain $X$, label set $ Y=\{0,1\}$ and the zero-one loss.
Given any probability distribution D over $ X\times \{0,1\} $, we've defined the Bayes classifier $ f_D $ to be-
$$ f_{D}(x)=
\begin{cases}
1 & \text{if }\mathbb{P}[y=1|x]\geq\tfrac{1}{2}\\
0 & \text{otherwise.}
\end{cases} $$
I wish to prove that, for any classifer $ g\colon X\rightarrow\{0,1\}$,
$ L_D(f_D)\leq L_D(g)$, which means that $ f_D$ is optimal.
$L_D(h) $ is defined to be the "true error" of the classifier $h$. That is, $L_D(h)=D\{(x,y)\mid h(x)\not = y\}$.
I'm having some hard time proving this given the definitions above, and some hints/intuition will be appreciated.
Given any probability distribution D over $ X\times \{0,1\} $, we've defined the Bayes classifier $ f_D $ to be-
$$ f_{D}(x)=
\begin{cases}
1 & \text{if }\mathbb{P}[y=1|x]\geq\tfrac{1}{2}\\
0 & \text{otherwise.}
\end{cases} $$
I wish to prove that, for any classifer $ g\colon X\rightarrow\{0,1\}$,
$ L_D(f_D)\leq L_D(g)$, which means that $ f_D$ is optimal.
$L_D(h) $ is defined to be the "true error" of the classifier $h$. That is, $L_D(h)=D\{(x,y)\mid h(x)\not = y\}$.
I'm having some hard time proving this given the definitions above, and some hints/intuition will be appreciated.
Solution
The true error of a classifier $h$ is
$$
\begin{align*}
L_D(h) &= \sideset{\mathbb{E}}{}{}_{x,y \sim D} \Pr[h(x) \neq y] \\ &=
\sideset{\mathbb{E}}{}{}_{x,y \sim D}
\begin{cases}
\Pr[y \neq 0|x] & \text{if } h(x) = 0, \\
\Pr[y \neq 1|x] & \text{if } h(x) = 1.
\end{cases}
\end{align*}
$$
(All probabilities are with respect to $D$.)
The optimal classifier is thus the one that minimizes the loss function
$$
\phi(x) = \begin{cases}
\Pr[y \neq 0|x] & \text{if } h(x) = 0, \\
\Pr[y \neq 1|x] & \text{if } h(x) = 1
\end{cases}
$$
for all $x$. We can rewrite the loss function as
$$
\phi(x) = \begin{cases}
\Pr[y = 1|x] & \text{if } h(x) = 0, \\
1-\Pr[y = 1|x] & \text{if } h(x) = 1
\end{cases}
$$
So if $\Pr[y=1|x] 1-\Pr[y=1|x]$ we should choose $h(x) = 1$; if $\Pr[y=1|x] = 1-\Pr[y=1|x]$ then the choice doesn't matter.
Finally, $\Pr[y=1|x] < 1-\Pr[y=1|x]$ is the same as $\Pr[y=1|x] < 1/2$, and this explains the formula for $f_D(x)$.
$$
\begin{align*}
L_D(h) &= \sideset{\mathbb{E}}{}{}_{x,y \sim D} \Pr[h(x) \neq y] \\ &=
\sideset{\mathbb{E}}{}{}_{x,y \sim D}
\begin{cases}
\Pr[y \neq 0|x] & \text{if } h(x) = 0, \\
\Pr[y \neq 1|x] & \text{if } h(x) = 1.
\end{cases}
\end{align*}
$$
(All probabilities are with respect to $D$.)
The optimal classifier is thus the one that minimizes the loss function
$$
\phi(x) = \begin{cases}
\Pr[y \neq 0|x] & \text{if } h(x) = 0, \\
\Pr[y \neq 1|x] & \text{if } h(x) = 1
\end{cases}
$$
for all $x$. We can rewrite the loss function as
$$
\phi(x) = \begin{cases}
\Pr[y = 1|x] & \text{if } h(x) = 0, \\
1-\Pr[y = 1|x] & \text{if } h(x) = 1
\end{cases}
$$
So if $\Pr[y=1|x] 1-\Pr[y=1|x]$ we should choose $h(x) = 1$; if $\Pr[y=1|x] = 1-\Pr[y=1|x]$ then the choice doesn't matter.
Finally, $\Pr[y=1|x] < 1-\Pr[y=1|x]$ is the same as $\Pr[y=1|x] < 1/2$, and this explains the formula for $f_D(x)$.
Context
StackExchange Computer Science Q#72295, answer score: 11
Revisions (0)
No revisions yet.