patternMinor
Linear Regression. Why do we take sum of errors to compute the cost?
Viewed 0 times
whylinearthecomputetakeregressionsumerrorscost
Problem
While calculating the cost in linear regression (for either with multiple features or a single one using gradient descent) we compute the cost by squaring and summing up the cost related to each training set. What is the point of summing up the cost related to each training example is what I do not understand.
Why not just calculate the cost for each training example individually ?
( I am following the Machine Learning course taught by Andrew Ng on coursera. )
Why not just calculate the cost for each training example individually ?
( I am following the Machine Learning course taught by Andrew Ng on coursera. )
Solution
You can usually view the cost function as the average squared error over some dataset with $N$ pairs of data, thus being defined as:
\begin{align}
J &= \frac{1}{N} \sum_{i=1}^{N} \left(f(x_i,\beta) - y_i \right)^2
\end{align}
We want the average error of our model (for all data we have) to decrease as we fine tune values for $\beta$, the vector parametrically defining how our model $f(\cdot,\cdot)$ works. So we like to look at all the data at once since it can be used to make changes to $\beta$ that should actually make the model improve as a whole.
If we instead made the cost function based on a subset of the dataset, you would end up with a cost function that may sub-optimally modify the value for $\beta$. This sub-optimal behavior may make us take longer to get to a local minimum of the cost function or even diverge from a good solution if you aren't careful with your hyper parameters.
Stochastic gradient descent or mini-batch gradient descent are methods that use a single piece of data or a subset of data from the dataset to make adjustments. These methods have actually found use for really large datasets where the longer convergence time is more worthwhile than the time it takes to do a whole pass on the data to compute necessary gradients and costs.
\begin{align}
J &= \frac{1}{N} \sum_{i=1}^{N} \left(f(x_i,\beta) - y_i \right)^2
\end{align}
We want the average error of our model (for all data we have) to decrease as we fine tune values for $\beta$, the vector parametrically defining how our model $f(\cdot,\cdot)$ works. So we like to look at all the data at once since it can be used to make changes to $\beta$ that should actually make the model improve as a whole.
If we instead made the cost function based on a subset of the dataset, you would end up with a cost function that may sub-optimally modify the value for $\beta$. This sub-optimal behavior may make us take longer to get to a local minimum of the cost function or even diverge from a good solution if you aren't careful with your hyper parameters.
Stochastic gradient descent or mini-batch gradient descent are methods that use a single piece of data or a subset of data from the dataset to make adjustments. These methods have actually found use for really large datasets where the longer convergence time is more worthwhile than the time it takes to do a whole pass on the data to compute necessary gradients and costs.
Context
StackExchange Computer Science Q#69994, answer score: 3
Revisions (0)
No revisions yet.