patternMinor

Fast algorithm for clustering groups of elements given their size/time

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

fastgroupselementssizetimealgorithmforclusteringgiventheir

Problem

I don't know if there is a canonical problem reducing my practical problem, so I will just try to describe it the best that I can.

I would like to cluster files into the specified number of groups, where each groups size (= the sum of sizes of files in this group) is as close as possible to each other group in this cluster (but not in other clusters). Here are the requirements:

The first group always contain one file, which is the biggest of all groups in the cluster.

Any other group but the first can have multiple files in it.

The number of groups in each cluster is constrained to a maximum specified by user (but there can be less groups if it's better or there's not enough files).

There's no constraint on the number of clusters (there can be as little or as many as necessary).

Goal (objective function): minimize the space left in all groups (of all clusters) while maximizing the number of groups per cluster (up to the maximum specified).

The reason behind these requirements is that I am encoding files together, and any remaining space in any group will need to be filled by null bytes, which is a waste of space.

Clarification on the objective and constraints that follow from the requirements and the problem statement:

Input is a list of files with their respective size.

Desired output: a list of clusters with each clusters being comprised of groups of files, each group having one or several concatenated files.

There must be at least 2 groups per cluster (except if no file is remaining) and up to a maximum of G groups (specified by user).

Each file can be assigned to any group whatsoever and each group can be assigned to any cluster.

The number of clusters can be chosen freely.

Here is a schema that shows one wrong and one good example of clustering schemes on 5 files (1 big file, and 4 files of exactly half the size of the big file) with a number of groups = 2:

The solution needs not be optimal, it can be sub-optimal as long as it

Solution

Finally I could devise a better algorithm by turning the problem upside down (ie, using bottom up construction instead of top-down).

In my OP algo, I first create a cluster and its groups, and then I walk through the whole files list until I could either fill completely the groups sizes, or there's no file in the files list small enough to fit.

Here it's the other way around: I walk through the files list, and for each file I either assign it to a group if it fits, or if it's too big I create a cluster and init it with this file. To do that, I continually maintain a list of groups sizes to fill, using insertion sort so that when I pick a file to organize, I just need to check its size against the first item of the "to-fill" list.

Here's the algorithm, running in O(n log(g)) (thank's to insertion sort or binary search trees):

For each file:
    - If to-fill list is empty or file.size > first-key(to-fill):
      * Create cluster c with file in first group g1
      * Add to-fill[file.size].append([c, g2], [c, g3], ..., [c, gn])
    - Else:
      * ksize = first-key(to-fill)
      * c, g = to-fill[ksize].popitem(0)
      * Add file to cluster c in group g
      * nsize = ksize - file.size
      * if nsize > 0:
        . to-fill[nsize].append([c, g])
        . sort to-fill if not an automatic ordering structure

Since it's running in O(n log(g)), the number of groups has little impact on the running time, contrary to the algo in the OP. Maybe it's possible to do better, but for now it is fast enough for me, since it can reasonably work on lists of 10M files under 20 seconds.

For those interested, here's a working implementation in Python (I used the sortedcontainers module for the sorted list, which runs in O(log(g)) instead of O(g) for insertion sort):

```
from collections import OrderedDict
from random import randint
from lib.sortedcontainers import SortedList

def gen_rand_fileslist(nbfiles=100, maxvalue=100):
fileslist = {}
for i in xrange(nbfiles):
fileslist["file_%i" % i] = randint(1, maxvalue)
return fileslist

def group_files_by_size_fast(fileslist, nbgroups, mode=1):
'''Given a files list with sizes, output a list where the files are grouped in nbgroups per cluster.'''
For each file:
- If to-fill list is empty or file.size > first-key(to-fill):
* Create cluster c with file in first group g1
* Add to-fill[file.size].append([c, g2], [c, g3], ..., [c, gn])
- Else:
* ksize = first-key(to-fill)
* c, g = to-fill[ksize].popitem(0)
* Add file to cluster c in group g
* nsize = ksize - file.size
* if nsize > 0:
. to-fill[nsize].append([c, g])
. sort to-fill if not an automatic ordering structure
'''
ftofill = SortedList()
ftofill_pointer = {}
fgrouped = [] # [] or {}
ford = sorted(fileslist.iteritems(), key=lambda x: x[1])
last_cid = -1
while ford:
fname, fsize = ford.pop()
#print "----\n"+fname, fsize
#if ftofill: print "beforebranch", fsize, ftofill[-1]
#print ftofill
if not ftofill or fsize > ftofill[-1]:
last_cid += 1
#print "Branch A: create cluster %i" % last_cid
fgrouped.append([])
#fgrouped[last_cid] = []
fgrouped[last_cid].append([fname])
if mode==0:
for g in xrange(nbgroups-1, 0, -1):
fgrouped[last_cid].append([])
if not fsize in ftofill_pointer:
ftofill_pointer[fsize] = []
ftofill_pointer[fsize].append((last_cid, g))
ftofill.add(fsize)
else:
for g in xrange(1, nbgroups):
try:
fgname, fgsize = ford.pop()
#print "Added to group %i: %s %i" % (g, fgname, fgsize)
except IndexError:
break
fgrouped[last_cid].append([fgname])
diff_size = fsize - fgsize
if diff_size > 0:
if not diff_size in ftofill_pointer:
ftofill_pointer[diff_size] = []
ftofill_pointer[diff_size].append((last_cid, g))
ftofill.add(diff_size)
else:
#print "Branch B"
ksize = ftofill.pop()
c, g = ftofill_pointer[ksize].pop()
#print "Assign to cluster %i group %i" % (c, g)
fgrouped[c][g].append(fname)
nsize = ksize - fsize
if nsize > 0:
if not nsize in ftofill_pointer:
ftofill_pointer[nsize] = []
ftofill_pointer[nsize].append((c, g))
ftofill.add(nsize)
return fgrouped
fgrouped2 = group_files_by_size_fast(fileslist, nbgroups)

def grouped_count_sizes(fileslist, fgrouped):
'''Comput

Code Snippets

For each file:
    - If to-fill list is empty or file.size > first-key(to-fill):
      * Create cluster c with file in first group g1
      * Add to-fill[file.size].append([c, g2], [c, g3], ..., [c, gn])
    - Else:
      * ksize = first-key(to-fill)
      * c, g = to-fill[ksize].popitem(0)
      * Add file to cluster c in group g
      * nsize = ksize - file.size
      * if nsize > 0:
        . to-fill[nsize].append([c, g])
        . sort to-fill if not an automatic ordering structure

from collections import OrderedDict
from random import randint
from lib.sortedcontainers import SortedList

def gen_rand_fileslist(nbfiles=100, maxvalue=100):
    fileslist = {}
    for i in xrange(nbfiles):
        fileslist["file_%i" % i] = randint(1, maxvalue)
    return fileslist

def group_files_by_size_fast(fileslist, nbgroups, mode=1):
    '''Given a files list with sizes, output a list where the files are grouped in nbgroups per cluster.'''
    For each file:
        - If to-fill list is empty or file.size > first-key(to-fill):
          * Create cluster c with file in first group g1
          * Add to-fill[file.size].append([c, g2], [c, g3], ..., [c, gn])
        - Else:
          * ksize = first-key(to-fill)
          * c, g = to-fill[ksize].popitem(0)
          * Add file to cluster c in group g
          * nsize = ksize - file.size
          * if nsize > 0:
            . to-fill[nsize].append([c, g])
            . sort to-fill if not an automatic ordering structure
        '''
    ftofill = SortedList()
    ftofill_pointer = {}
    fgrouped = [] # [] or {}
    ford = sorted(fileslist.iteritems(), key=lambda x: x[1])
    last_cid = -1
    while ford:
        fname, fsize = ford.pop()
        #print "----\n"+fname, fsize
        #if ftofill: print "beforebranch", fsize, ftofill[-1]
        #print ftofill
        if not ftofill or fsize > ftofill[-1]:
            last_cid += 1
            #print "Branch A: create cluster %i" % last_cid
            fgrouped.append([])
            #fgrouped[last_cid] = []
            fgrouped[last_cid].append([fname])
            if mode==0:
                for g in xrange(nbgroups-1, 0, -1):
                    fgrouped[last_cid].append([])
                    if not fsize in ftofill_pointer:
                        ftofill_pointer[fsize] = []
                    ftofill_pointer[fsize].append((last_cid, g))
                    ftofill.add(fsize)
            else:
                for g in xrange(1, nbgroups):
                    try:
                        fgname, fgsize = ford.pop()
                        #print "Added to group %i: %s %i" % (g, fgname, fgsize)
                    except IndexError:
                        break
                    fgrouped[last_cid].append([fgname])
                    diff_size = fsize - fgsize
                    if diff_size > 0:
                        if not diff_size in ftofill_pointer:
                            ftofill_pointer[diff_size] = []
                        ftofill_pointer[diff_size].append((last_cid, g))
                        ftofill.add(diff_size)
        else:
            #print "Branch B"
            ksize = ftofill.pop()
            c, g = ftofill_pointer[ksize].pop()
            #print "Assign to cluster %i group %i" % (c, g)
            fgrouped[c][g].append(fname)
            nsize = ksize - fsize
            if nsize > 0:
                if not nsize in ftofill_pointer:
                    ftofill_pointer[nsize] = []
                ftofill_

Context

StackExchange Computer Science Q#44406, answer score: 2

Revisions (0)

No revisions yet.