HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Use subset of training data as prediction data

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
predictiontrainingusedatasubset

Problem

At our company, we've started using Amazon Machine Learning to predict the likelihood of a certain segment of our customers cancelling their subscription.

We only have 500 customers in that segment so we uploaded training data of all our customers who

  • are on a subscription or



  • were on a subscription but cancelled



Amazon tells us there's an AUC score 0.70 which is just ok from what I read.

I'm new to Machine Learning so my question is the following:

Out of the 500 customers that are in the training set, can I load the ones that haven't cancelled into the prediction set to predict the likelihood of them cancelling their subscription in the future? Or can I only use the model for new customers that are not included in the training data?

Solution

The most important thing is to avoid "testing on the training set". If you trained a model on a set of 500 users and then evaluate how well it performs by testing it on some of those same users, you're committing the "testing on the training set" sin. The consequences of doing that is that your evaluation results will be misleading: they will over-estimate the effectiveness of your model.

There are various standard methods to avoid this problem. One method is to use a holdout set: you set aside, say, 67% of your data for training and the other 33% for testing. You train a model on the first 70% (the training set), and then evaluate how well your model performs on the other 30% (the testing set). This will then give you a reasonable, unbiased estimate of the performance of your machine learning model. In practice, a test+train division is often not enough; for various reasons, you'll often want three sets, a training set, a validation set, and a testing set. You can read more about that in standard resources.

Another reasonable method is to use cross-validation to estimate the effectiveness of your scheme. Basically, this involves repeatedly dividing up the data, learning a model on one portion of the data and evaluating on another portion, and then averaging the results. This is a reasonable alternative to dividing the data into train+validate+test sets, when you don't have a lot of labelled data -- as in your case.

These comments are all about how to build the model, select parameters for your model, and evaluate the model's performance (predict how effective it will be in practice). Once you've done all that and settled on a model, then you can simply apply the model to new customers to predict their likelihood of cancelling their subscription.

Finally, one last comment: your approach assumes uniformity over time: i.e., that the factors that influence behavior in the past will continue to hold in the same way in the future. This may or may not be realistic. For instance, maybe today some people are more likely to cancel cable subscriptions because of the availability of Netflix/Hulu/etc. whereas ten years ago that wasn't as much of a factor. In many areas, things can change over time. Read about concept drift to learn more about that particular challenge.

If this didn't answer your question, edit the question to clarify in more detail what you are asking about.

Context

StackExchange Computer Science Q#49121, answer score: 3

Revisions (0)

No revisions yet.