patternMinor
Fast algorithm for clustering groups of elements given their size/time
Viewed 0 times
fastgroupselementssizetimealgorithmforclusteringgiventheir
Problem
I don't know if there is a canonical problem reducing my practical problem, so I will just try to describe it the best that I can.
I would like to cluster files into the specified number of groups, where each groups size (= the sum of sizes of files in this group) is as close as possible to each other group in this cluster (but not in other clusters). Here are the requirements:
The reason behind these requirements is that I am encoding files together, and any remaining space in any group will need to be filled by null bytes, which is a waste of space.
Clarification on the objective and constraints that follow from the requirements and the problem statement:
Here is a schema that shows one wrong and one good example of clustering schemes on 5 files (1 big file, and 4 files of exactly half the size of the big file) with a number of groups = 2:
The solution needs not be optimal, it can be sub-optimal as long as it
I would like to cluster files into the specified number of groups, where each groups size (= the sum of sizes of files in this group) is as close as possible to each other group in this cluster (but not in other clusters). Here are the requirements:
- The first group always contain one file, which is the biggest of all groups in the cluster.
- Any other group but the first can have multiple files in it.
- The number of groups in each cluster is constrained to a maximum specified by user (but there can be less groups if it's better or there's not enough files).
- There's no constraint on the number of clusters (there can be as little or as many as necessary).
- Goal (objective function): minimize the space left in all groups (of all clusters) while maximizing the number of groups per cluster (up to the maximum specified).
The reason behind these requirements is that I am encoding files together, and any remaining space in any group will need to be filled by null bytes, which is a waste of space.
Clarification on the objective and constraints that follow from the requirements and the problem statement:
- Input is a list of files with their respective size.
- Desired output: a list of clusters with each clusters being comprised of groups of files, each group having one or several concatenated files.
- There must be at least 2 groups per cluster (except if no file is remaining) and up to a maximum of G groups (specified by user).
- Each file can be assigned to any group whatsoever and each group can be assigned to any cluster.
- The number of clusters can be chosen freely.
Here is a schema that shows one wrong and one good example of clustering schemes on 5 files (1 big file, and 4 files of exactly half the size of the big file) with a number of groups = 2:
The solution needs not be optimal, it can be sub-optimal as long as it
Solution
Finally I could devise a better algorithm by turning the problem upside down (ie, using bottom up construction instead of top-down).
In my OP algo, I first create a cluster and its groups, and then I walk through the whole files list until I could either fill completely the groups sizes, or there's no file in the files list small enough to fit.
Here it's the other way around: I walk through the files list, and for each file I either assign it to a group if it fits, or if it's too big I create a cluster and init it with this file. To do that, I continually maintain a list of groups sizes to fill, using insertion sort so that when I pick a file to organize, I just need to check its size against the first item of the "to-fill" list.
Here's the algorithm, running in O(n log(g)) (thank's to insertion sort or binary search trees):
Since it's running in O(n log(g)), the number of groups has little impact on the running time, contrary to the algo in the OP. Maybe it's possible to do better, but for now it is fast enough for me, since it can reasonably work on lists of 10M files under 20 seconds.
For those interested, here's a working implementation in Python (I used the sortedcontainers module for the sorted list, which runs in O(log(g)) instead of O(g) for insertion sort):
```
from collections import OrderedDict
from random import randint
from lib.sortedcontainers import SortedList
def gen_rand_fileslist(nbfiles=100, maxvalue=100):
fileslist = {}
for i in xrange(nbfiles):
fileslist["file_%i" % i] = randint(1, maxvalue)
return fileslist
def group_files_by_size_fast(fileslist, nbgroups, mode=1):
'''Given a files list with sizes, output a list where the files are grouped in nbgroups per cluster.'''
For each file:
- If to-fill list is empty or file.size > first-key(to-fill):
* Create cluster c with file in first group g1
* Add to-fill[file.size].append([c, g2], [c, g3], ..., [c, gn])
- Else:
* ksize = first-key(to-fill)
* c, g = to-fill[ksize].popitem(0)
* Add file to cluster c in group g
* nsize = ksize - file.size
* if nsize > 0:
. to-fill[nsize].append([c, g])
. sort to-fill if not an automatic ordering structure
'''
ftofill = SortedList()
ftofill_pointer = {}
fgrouped = [] # [] or {}
ford = sorted(fileslist.iteritems(), key=lambda x: x[1])
last_cid = -1
while ford:
fname, fsize = ford.pop()
#print "----\n"+fname, fsize
#if ftofill: print "beforebranch", fsize, ftofill[-1]
#print ftofill
if not ftofill or fsize > ftofill[-1]:
last_cid += 1
#print "Branch A: create cluster %i" % last_cid
fgrouped.append([])
#fgrouped[last_cid] = []
fgrouped[last_cid].append([fname])
if mode==0:
for g in xrange(nbgroups-1, 0, -1):
fgrouped[last_cid].append([])
if not fsize in ftofill_pointer:
ftofill_pointer[fsize] = []
ftofill_pointer[fsize].append((last_cid, g))
ftofill.add(fsize)
else:
for g in xrange(1, nbgroups):
try:
fgname, fgsize = ford.pop()
#print "Added to group %i: %s %i" % (g, fgname, fgsize)
except IndexError:
break
fgrouped[last_cid].append([fgname])
diff_size = fsize - fgsize
if diff_size > 0:
if not diff_size in ftofill_pointer:
ftofill_pointer[diff_size] = []
ftofill_pointer[diff_size].append((last_cid, g))
ftofill.add(diff_size)
else:
#print "Branch B"
ksize = ftofill.pop()
c, g = ftofill_pointer[ksize].pop()
#print "Assign to cluster %i group %i" % (c, g)
fgrouped[c][g].append(fname)
nsize = ksize - fsize
if nsize > 0:
if not nsize in ftofill_pointer:
ftofill_pointer[nsize] = []
ftofill_pointer[nsize].append((c, g))
ftofill.add(nsize)
return fgrouped
fgrouped2 = group_files_by_size_fast(fileslist, nbgroups)
def grouped_count_sizes(fileslist, fgrouped):
'''Comput
In my OP algo, I first create a cluster and its groups, and then I walk through the whole files list until I could either fill completely the groups sizes, or there's no file in the files list small enough to fit.
Here it's the other way around: I walk through the files list, and for each file I either assign it to a group if it fits, or if it's too big I create a cluster and init it with this file. To do that, I continually maintain a list of groups sizes to fill, using insertion sort so that when I pick a file to organize, I just need to check its size against the first item of the "to-fill" list.
Here's the algorithm, running in O(n log(g)) (thank's to insertion sort or binary search trees):
For each file:
- If to-fill list is empty or file.size > first-key(to-fill):
* Create cluster c with file in first group g1
* Add to-fill[file.size].append([c, g2], [c, g3], ..., [c, gn])
- Else:
* ksize = first-key(to-fill)
* c, g = to-fill[ksize].popitem(0)
* Add file to cluster c in group g
* nsize = ksize - file.size
* if nsize > 0:
. to-fill[nsize].append([c, g])
. sort to-fill if not an automatic ordering structureSince it's running in O(n log(g)), the number of groups has little impact on the running time, contrary to the algo in the OP. Maybe it's possible to do better, but for now it is fast enough for me, since it can reasonably work on lists of 10M files under 20 seconds.
For those interested, here's a working implementation in Python (I used the sortedcontainers module for the sorted list, which runs in O(log(g)) instead of O(g) for insertion sort):
```
from collections import OrderedDict
from random import randint
from lib.sortedcontainers import SortedList
def gen_rand_fileslist(nbfiles=100, maxvalue=100):
fileslist = {}
for i in xrange(nbfiles):
fileslist["file_%i" % i] = randint(1, maxvalue)
return fileslist
def group_files_by_size_fast(fileslist, nbgroups, mode=1):
'''Given a files list with sizes, output a list where the files are grouped in nbgroups per cluster.'''
For each file:
- If to-fill list is empty or file.size > first-key(to-fill):
* Create cluster c with file in first group g1
* Add to-fill[file.size].append([c, g2], [c, g3], ..., [c, gn])
- Else:
* ksize = first-key(to-fill)
* c, g = to-fill[ksize].popitem(0)
* Add file to cluster c in group g
* nsize = ksize - file.size
* if nsize > 0:
. to-fill[nsize].append([c, g])
. sort to-fill if not an automatic ordering structure
'''
ftofill = SortedList()
ftofill_pointer = {}
fgrouped = [] # [] or {}
ford = sorted(fileslist.iteritems(), key=lambda x: x[1])
last_cid = -1
while ford:
fname, fsize = ford.pop()
#print "----\n"+fname, fsize
#if ftofill: print "beforebranch", fsize, ftofill[-1]
#print ftofill
if not ftofill or fsize > ftofill[-1]:
last_cid += 1
#print "Branch A: create cluster %i" % last_cid
fgrouped.append([])
#fgrouped[last_cid] = []
fgrouped[last_cid].append([fname])
if mode==0:
for g in xrange(nbgroups-1, 0, -1):
fgrouped[last_cid].append([])
if not fsize in ftofill_pointer:
ftofill_pointer[fsize] = []
ftofill_pointer[fsize].append((last_cid, g))
ftofill.add(fsize)
else:
for g in xrange(1, nbgroups):
try:
fgname, fgsize = ford.pop()
#print "Added to group %i: %s %i" % (g, fgname, fgsize)
except IndexError:
break
fgrouped[last_cid].append([fgname])
diff_size = fsize - fgsize
if diff_size > 0:
if not diff_size in ftofill_pointer:
ftofill_pointer[diff_size] = []
ftofill_pointer[diff_size].append((last_cid, g))
ftofill.add(diff_size)
else:
#print "Branch B"
ksize = ftofill.pop()
c, g = ftofill_pointer[ksize].pop()
#print "Assign to cluster %i group %i" % (c, g)
fgrouped[c][g].append(fname)
nsize = ksize - fsize
if nsize > 0:
if not nsize in ftofill_pointer:
ftofill_pointer[nsize] = []
ftofill_pointer[nsize].append((c, g))
ftofill.add(nsize)
return fgrouped
fgrouped2 = group_files_by_size_fast(fileslist, nbgroups)
def grouped_count_sizes(fileslist, fgrouped):
'''Comput
Code Snippets
For each file:
- If to-fill list is empty or file.size > first-key(to-fill):
* Create cluster c with file in first group g1
* Add to-fill[file.size].append([c, g2], [c, g3], ..., [c, gn])
- Else:
* ksize = first-key(to-fill)
* c, g = to-fill[ksize].popitem(0)
* Add file to cluster c in group g
* nsize = ksize - file.size
* if nsize > 0:
. to-fill[nsize].append([c, g])
. sort to-fill if not an automatic ordering structurefrom collections import OrderedDict
from random import randint
from lib.sortedcontainers import SortedList
def gen_rand_fileslist(nbfiles=100, maxvalue=100):
fileslist = {}
for i in xrange(nbfiles):
fileslist["file_%i" % i] = randint(1, maxvalue)
return fileslist
def group_files_by_size_fast(fileslist, nbgroups, mode=1):
'''Given a files list with sizes, output a list where the files are grouped in nbgroups per cluster.'''
For each file:
- If to-fill list is empty or file.size > first-key(to-fill):
* Create cluster c with file in first group g1
* Add to-fill[file.size].append([c, g2], [c, g3], ..., [c, gn])
- Else:
* ksize = first-key(to-fill)
* c, g = to-fill[ksize].popitem(0)
* Add file to cluster c in group g
* nsize = ksize - file.size
* if nsize > 0:
. to-fill[nsize].append([c, g])
. sort to-fill if not an automatic ordering structure
'''
ftofill = SortedList()
ftofill_pointer = {}
fgrouped = [] # [] or {}
ford = sorted(fileslist.iteritems(), key=lambda x: x[1])
last_cid = -1
while ford:
fname, fsize = ford.pop()
#print "----\n"+fname, fsize
#if ftofill: print "beforebranch", fsize, ftofill[-1]
#print ftofill
if not ftofill or fsize > ftofill[-1]:
last_cid += 1
#print "Branch A: create cluster %i" % last_cid
fgrouped.append([])
#fgrouped[last_cid] = []
fgrouped[last_cid].append([fname])
if mode==0:
for g in xrange(nbgroups-1, 0, -1):
fgrouped[last_cid].append([])
if not fsize in ftofill_pointer:
ftofill_pointer[fsize] = []
ftofill_pointer[fsize].append((last_cid, g))
ftofill.add(fsize)
else:
for g in xrange(1, nbgroups):
try:
fgname, fgsize = ford.pop()
#print "Added to group %i: %s %i" % (g, fgname, fgsize)
except IndexError:
break
fgrouped[last_cid].append([fgname])
diff_size = fsize - fgsize
if diff_size > 0:
if not diff_size in ftofill_pointer:
ftofill_pointer[diff_size] = []
ftofill_pointer[diff_size].append((last_cid, g))
ftofill.add(diff_size)
else:
#print "Branch B"
ksize = ftofill.pop()
c, g = ftofill_pointer[ksize].pop()
#print "Assign to cluster %i group %i" % (c, g)
fgrouped[c][g].append(fname)
nsize = ksize - fsize
if nsize > 0:
if not nsize in ftofill_pointer:
ftofill_pointer[nsize] = []
ftofill_Context
StackExchange Computer Science Q#44406, answer score: 2
Revisions (0)
No revisions yet.