patternMinor
Applicability of Machine Learning on Images.
Viewed 0 times
machineapplicabilityimageslearning
Problem
I am trying to use machine learning to detect and locate an object in a greyscale image. I have about 700 images which are 360x360 pixels in size.
See below for an example of the object and image:
In the first image, the object is indicated by the red arrow. In the second image, the object doesn't exist. Additionally, the object can occur anywhere along the X-axis (labelled theta) with a very slight bias towards the central pixels.
I tried to reduce the number of features by collapsing the image to a 1D array, like so:
This ensures that I don't discard information that is crucial to my objectives (detecting and locating).
I now have a 1D array which is whiter at indices where the object is located and no noticeable brightness when the object is absent. So I created a training set of 500(samples) x 360(features) with 200 other samples reserved for cross-validation, and trained an SVM on it. The results on the cross-validation samples reported an accuracy of about 66%, and this doesn't seem to change with the number of training samples.
So I have 4 questions for the community:
-
Does the outlined problem have an acceptable solution in Machine Learning, or does it seem incredibly absurd to apply Machine Learning at all?
-
I can look for more images, or even synthesise new images from existing ones (by arbitrarily shifting the image left or right). Can I, however, expect the results to get any better?
-
With the number of features that I have, what is the number of samples that I must typically have (as a very rough guess, given that you have seen what kind of images I'm dealing with).
-
The digit recognition problem in Machine learning houses an issue which is relevant to my case -- the algorithm performs poorly when the digits are written off-centre. In my case, the pattern that I am looking for in the 1D array (a bump) is independent of the location so that any one particular group of pixels
See below for an example of the object and image:
In the first image, the object is indicated by the red arrow. In the second image, the object doesn't exist. Additionally, the object can occur anywhere along the X-axis (labelled theta) with a very slight bias towards the central pixels.
I tried to reduce the number of features by collapsing the image to a 1D array, like so:
array = numpy.sum(numpy.absolute(image), axis=0)This ensures that I don't discard information that is crucial to my objectives (detecting and locating).
I now have a 1D array which is whiter at indices where the object is located and no noticeable brightness when the object is absent. So I created a training set of 500(samples) x 360(features) with 200 other samples reserved for cross-validation, and trained an SVM on it. The results on the cross-validation samples reported an accuracy of about 66%, and this doesn't seem to change with the number of training samples.
So I have 4 questions for the community:
-
Does the outlined problem have an acceptable solution in Machine Learning, or does it seem incredibly absurd to apply Machine Learning at all?
-
I can look for more images, or even synthesise new images from existing ones (by arbitrarily shifting the image left or right). Can I, however, expect the results to get any better?
-
With the number of features that I have, what is the number of samples that I must typically have (as a very rough guess, given that you have seen what kind of images I'm dealing with).
-
The digit recognition problem in Machine learning houses an issue which is relevant to my case -- the algorithm performs poorly when the digits are written off-centre. In my case, the pattern that I am looking for in the 1D array (a bump) is independent of the location so that any one particular group of pixels
Solution
It looks plausible that this is the kind of problem machine learning will be good at. I suspect you might be able to achieve much higher accuracy, but the only way to know for sure is to try it.
For your current approach, the size of your training set is reasonable, assuming you're using a linear SVM. A linear SVM will have about 360 parameters, and your training set is larger than that, so this is not an unreasonable size of training set.
There are many things you could try, to see if they improve accuracy. Here are two.
Translation invariance. The properties of an object don't depend on where it is located. However, the way you're currently training the SVM doesn't take this into account, so you might be missing an opportunity to do a lot better.
One potential solution: first, take all of the positive samples in your training set (the ones that do contain a object) and shift them left or right so the object is in the same position in each (always exactly in the middle). Take each negative sample and shift it a random amount. Then, train the SVM on the modified training set. This builds a SVM model that recognizes images that contain an object at the center. Once you've got a trained SVM model, given a new image I, try all possible ways to shift it and apply the SVM classifier to each; if any of those says "yes", then you've found an object.
A different potential solution that might work better: construct features that are inherently translation-invariant, by definition. For instance, instead of using the 360 elements of the 1D array as a feature, use the average of the array, or the standard deviation. Then train a classifier on those few features.
You can probably design some other, more complex translation-invariant features. For instance, you could try computing the average or standard deviation of the first derivative of the 1D array (the average or standard deviation of $A[i+1]-A[i]$, over all $i$). Or, rather than doing this on the reduced 1D array, even better, do it on the 2D array: compute some statistics of the first derivative (in the horizontal direction) of the 2D image, and similarly for the first derivative (in the vertical direction). You could also build a histogram of the intensity distribution of the original 2D image (e.g., 10 buckets; the bucket for 0.2-0.3 counts the number of pixels whose intensity is in the range $[0.2,0.3)$). You could grid the 2D image into 90x90 non-overlapping cells, where each cell is a 4x4 pixel square, compute the average intensity of each cell, and then compute a histogram of those values. For instance, this might capture the fact that when there is an object there will be many cells that are all-white, whereas an image with no object probably will have few or no such cells. Try different ideas along those lines, and you might find some that are effective.
Convolutional neural networks. By reducing to a 1D array, you've thrown away a bunch of information. You could try applying a CNN classifier directly to the 2D image. There's been a lot of work on how to do this in the computer vision community over the past 5 years, and CNN's seem to do significantly better than anything else. You could take a look at the Caffe toolset (or Theano or Tensorflow or any of a number of others). This might benefit from having more training images, but you can try it on the set you've already got.
Regularization. Don't forget to use regularization and to optimize hyper-parameters (often, using cross-validation and grid search).
For your current approach, the size of your training set is reasonable, assuming you're using a linear SVM. A linear SVM will have about 360 parameters, and your training set is larger than that, so this is not an unreasonable size of training set.
There are many things you could try, to see if they improve accuracy. Here are two.
Translation invariance. The properties of an object don't depend on where it is located. However, the way you're currently training the SVM doesn't take this into account, so you might be missing an opportunity to do a lot better.
One potential solution: first, take all of the positive samples in your training set (the ones that do contain a object) and shift them left or right so the object is in the same position in each (always exactly in the middle). Take each negative sample and shift it a random amount. Then, train the SVM on the modified training set. This builds a SVM model that recognizes images that contain an object at the center. Once you've got a trained SVM model, given a new image I, try all possible ways to shift it and apply the SVM classifier to each; if any of those says "yes", then you've found an object.
A different potential solution that might work better: construct features that are inherently translation-invariant, by definition. For instance, instead of using the 360 elements of the 1D array as a feature, use the average of the array, or the standard deviation. Then train a classifier on those few features.
You can probably design some other, more complex translation-invariant features. For instance, you could try computing the average or standard deviation of the first derivative of the 1D array (the average or standard deviation of $A[i+1]-A[i]$, over all $i$). Or, rather than doing this on the reduced 1D array, even better, do it on the 2D array: compute some statistics of the first derivative (in the horizontal direction) of the 2D image, and similarly for the first derivative (in the vertical direction). You could also build a histogram of the intensity distribution of the original 2D image (e.g., 10 buckets; the bucket for 0.2-0.3 counts the number of pixels whose intensity is in the range $[0.2,0.3)$). You could grid the 2D image into 90x90 non-overlapping cells, where each cell is a 4x4 pixel square, compute the average intensity of each cell, and then compute a histogram of those values. For instance, this might capture the fact that when there is an object there will be many cells that are all-white, whereas an image with no object probably will have few or no such cells. Try different ideas along those lines, and you might find some that are effective.
Convolutional neural networks. By reducing to a 1D array, you've thrown away a bunch of information. You could try applying a CNN classifier directly to the 2D image. There's been a lot of work on how to do this in the computer vision community over the past 5 years, and CNN's seem to do significantly better than anything else. You could take a look at the Caffe toolset (or Theano or Tensorflow or any of a number of others). This might benefit from having more training images, but you can try it on the set you've already got.
Regularization. Don't forget to use regularization and to optimize hyper-parameters (often, using cross-validation and grid search).
Context
StackExchange Computer Science Q#65497, answer score: 3
Revisions (0)
No revisions yet.