snippetMinor
How to scale-down in a multi-tenant environment?
Viewed 0 times
tenantmultienvironmentdownhowscale
Problem
Cloud environments in AWS allow for multi-tenancy managed by the user himself, classic example are container orchestrators such as ECS or Kubernetes.
When you have two services, one needs memory another cpu and you put these in a single cluster. Then scaling-up is relatively trivial. Each time you need more capacity in terms of either cpu or memory, add more capacity. Since EC2 capacity means units in cpu and memory both.
Scaling up based on a single metric can very easily achieved using CloudWatch Alarms.
When scaling down, in order to reduce cost it requires to take into account both memory and cpu limits and not let any of the two drop below the required amount.
Since unfortunately CloudWatch Alarms do not allow to use boolean logic or take into account multiple metrics.
What is a good way to implement scale down of capacity for an auto scaling group?
When you have two services, one needs memory another cpu and you put these in a single cluster. Then scaling-up is relatively trivial. Each time you need more capacity in terms of either cpu or memory, add more capacity. Since EC2 capacity means units in cpu and memory both.
Scaling up based on a single metric can very easily achieved using CloudWatch Alarms.
When scaling down, in order to reduce cost it requires to take into account both memory and cpu limits and not let any of the two drop below the required amount.
Since unfortunately CloudWatch Alarms do not allow to use boolean logic or take into account multiple metrics.
What is a good way to implement scale down of capacity for an auto scaling group?
Solution
Autoscaling is a good case for machine learning
This is a hard problem to do well.
What you really want is something like Nest Thermostat for your EC2 infrastructure.
There are (aforementioned) multiple dimensions of resource demand/limitations.
There are multiple indicators of demand (in addition to the above).
There are multiple common patterns of changing demand over time.
There are multiple financial decision factors.
Before long, if you try to hand-optimize on hand-selected features you're going to have a monster of technical debt that is possibly more complicated than any other logic in your site. Amazon makes more money when you err (with a large margin) on the side of caution, so their tools will probably never get close to what you want.
Instead, choose an architecture/technology stack that can grow/scale so you don't have to get it exactly right the first time. Then pick a few factors which you think are obvious. Then try to come up with a way to sort multiple representative possibilities in order of preference. Then collect some real world data covering all those points. If you're lucky, a simple obvious hand-coded solution will jump out at you from looking at the data. If not, code up something that will give you an approximate model f(x1,x2,x3,x4) --> y * app nodes, using an appropriate algorithm.
I bet you didn't think this one was going to be so much fun!
This is a hard problem to do well.
What you really want is something like Nest Thermostat for your EC2 infrastructure.
There are (aforementioned) multiple dimensions of resource demand/limitations.
- CPU
- memory
- disk space
- disk IO
- network IO
- concurrency/latency/queue depth
There are multiple indicators of demand (in addition to the above).
- concurrent unique visits/sessions
- pages per visit/session
- rate of engagement/interaction feature usage
There are multiple common patterns of changing demand over time.
- daily user demand cycles
- weekly user demand cycles
- monthly ...
- annual ...
- special event/days
- DDoS load
- media/marketing exposure traffic spikes
There are multiple financial decision factors.
- does revenue scale with traffic? How? (How conservative do you need to be?)
- is there a hidden cost to control (transaction costs, limits)?
- what's the cost model of scaling? (things in a pipeline scale together, things in a load-balanced cluster scale independently)
Before long, if you try to hand-optimize on hand-selected features you're going to have a monster of technical debt that is possibly more complicated than any other logic in your site. Amazon makes more money when you err (with a large margin) on the side of caution, so their tools will probably never get close to what you want.
Instead, choose an architecture/technology stack that can grow/scale so you don't have to get it exactly right the first time. Then pick a few factors which you think are obvious. Then try to come up with a way to sort multiple representative possibilities in order of preference. Then collect some real world data covering all those points. If you're lucky, a simple obvious hand-coded solution will jump out at you from looking at the data. If not, code up something that will give you an approximate model f(x1,x2,x3,x4) --> y * app nodes, using an appropriate algorithm.
I bet you didn't think this one was going to be so much fun!
Context
StackExchange DevOps Q#1398, answer score: 2
Revisions (0)
No revisions yet.