snippetMinor
Efficient way to implement periodic events
Viewed 0 times
eventsimplementefficientwayperiodic
Problem
I need to implement a system that can schedule and dispatch periodic jobs and needs to support a very high load (millions of jobs per second). Each job in the system has a period that is some multiple of 15 seconds, with the maximum allowed period being 1 day.
The system needs to support adding new jobs and removing / replacing existing jobs. However, these occur relatively infrequently (aside from when the system starts up and has to add the initial batch of jobs), so it is okay to make trade-offs with regards to insertion / deletion time.
Jobs scheduling times should be spread out so as to avoid "spikes" in number jobs dispatched. For example, if we had a million jobs that all had a period of 15 seconds, we wouldn't want to dispatch all of them at the exact same time every 15 seconds, and then wait another 15 seconds not doing anything before doing it again. It would be better to spread them out so around 60-70k jobs are dispatched every second, as long as each job is still scheduled to run 15 seconds apart.
Currently, around 10% (maybe less) of the jobs in the system have a period of 15 seconds (the minimum possible), while the remaining jobs have periods between 30 seconds and 1 day.
Our current implementation uses something like a hashed timer wheel, which has 15,000 buckets representing the number of milliseconds in every 15 second interval. Jobs are added to a random bucket in the wheel, and on every tick of the wheel, we iterate through every job in the bucket, dispatching the ones that are "expired" and incrementing a counter for the remaining jobs. For example, a job with a 60 second period would only be dispatched once every 4 turns of the wheel, even though it is iterated over on every turn of the wheel. The problem here is that on each tick of the wheel, the majority of the jobs in the next bucket are not actually ready to be dispatched, but we have to iterate over them anyways to increment their counter.
Is there a better algorithm / data structu
The system needs to support adding new jobs and removing / replacing existing jobs. However, these occur relatively infrequently (aside from when the system starts up and has to add the initial batch of jobs), so it is okay to make trade-offs with regards to insertion / deletion time.
Jobs scheduling times should be spread out so as to avoid "spikes" in number jobs dispatched. For example, if we had a million jobs that all had a period of 15 seconds, we wouldn't want to dispatch all of them at the exact same time every 15 seconds, and then wait another 15 seconds not doing anything before doing it again. It would be better to spread them out so around 60-70k jobs are dispatched every second, as long as each job is still scheduled to run 15 seconds apart.
Currently, around 10% (maybe less) of the jobs in the system have a period of 15 seconds (the minimum possible), while the remaining jobs have periods between 30 seconds and 1 day.
Our current implementation uses something like a hashed timer wheel, which has 15,000 buckets representing the number of milliseconds in every 15 second interval. Jobs are added to a random bucket in the wheel, and on every tick of the wheel, we iterate through every job in the bucket, dispatching the ones that are "expired" and incrementing a counter for the remaining jobs. For example, a job with a 60 second period would only be dispatched once every 4 turns of the wheel, even though it is iterated over on every turn of the wheel. The problem here is that on each tick of the wheel, the majority of the jobs in the next bucket are not actually ready to be dispatched, but we have to iterate over them anyways to increment their counter.
Is there a better algorithm / data structu
Solution
A standard approach for scheduling is a priority queue of future jobs: each job has a time at which it should be dispatched, and that time is the key for the priority queue.
When a new job arrives, pick a random time within its period, and schedule it to occur at that period. For instance, if a job arrives with a period of 75 seconds, pick a random number $t$ in the range [0,75), then schedule it to be first dispatched $t$ seconds from now. To schedule it, add it to the priority queue with its dispatch time as the key.
Each second, retrieve all jobs scheduled to be dispatched in this second, delete them from the priority queue, and dispatch them. Also schedule each to occur at a time in the future based on its period (thus inserting it back into the priority queue with a later dispatch time).
The randomization should ensure a reasonable balance of jobs, especially when you have a large number of jobs to schedule, and the priority queue should make it efficient to insert, delete, and dispatch jobs. In particular, all of those operations can be done in $O(\log n)$ time, where $n$ is the number of jobs.
You could check whether that will meet your requirements. If not, that should help clarify your specific requirements.
When a new job arrives, pick a random time within its period, and schedule it to occur at that period. For instance, if a job arrives with a period of 75 seconds, pick a random number $t$ in the range [0,75), then schedule it to be first dispatched $t$ seconds from now. To schedule it, add it to the priority queue with its dispatch time as the key.
Each second, retrieve all jobs scheduled to be dispatched in this second, delete them from the priority queue, and dispatch them. Also schedule each to occur at a time in the future based on its period (thus inserting it back into the priority queue with a later dispatch time).
The randomization should ensure a reasonable balance of jobs, especially when you have a large number of jobs to schedule, and the priority queue should make it efficient to insert, delete, and dispatch jobs. In particular, all of those operations can be done in $O(\log n)$ time, where $n$ is the number of jobs.
You could check whether that will meet your requirements. If not, that should help clarify your specific requirements.
Context
StackExchange Computer Science Q#126579, answer score: 2
Revisions (0)
No revisions yet.