HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Group a list of `dict` by all keys except one

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dictgroupallkeysonelistexcept

Problem

I have a function that I use to convert a list of dictionaries into a list of tuples, similar to itertools.groupby() would do in an ideal world. The goal is to make a list of {unique-dict} => [list of values]. I'm having a hard time explaining it, so I hope the example makes it clear.

I couldn't find a thing on google, so I wrote it pretty quickly, and I am wondering if there is a better way to do this (or a standard library even)

  • Naming things is hard... what is a better one?



  • This is a task I performed fairly often when receiving user input



  • The keys in each dict may not be consistent (some may be missing)



  • It is OK to destroy or mangle the original input list & items



  • Validation is done before calling, so 'key_field' is always present



  • It is expected that values can be duplicated in the list.



  • { status: 1 } => [1,2,3,4,1,2] is ok, if the values are actually present twice.



Simple little python function:

def group_by_excluding_key(list_of_dicts, key_field):
    """
    Takes a list of `dict` items and groups by ALL KEYS in the dict EXCEPT the key_field.
    :param list_of_dicts: List of dicts to group
    :param key_field: key field in dict which should be excluded from the grouping
    """

    output = []

    for item in list_of_dicts:
        found = False
        item_key = item.pop(key_field)

        for existing_group, found_keys in output:
            if existing_group.viewitems() == item.viewitems():
                found_keys.append(item_key)
                found = True
                break

        if not found:
            output.append((item, [item_key]))

    return output


Example Input/Output

```
from pprint import pprint

data = [
{'id': 1, 'status': 1, 'product': 1},
{'id': 2, 'status': 1, 'product': 1},
{'id': 7, 'status': 1, 'product': 2},
{'id': 9, 'status': 1, 'product': 2},
{'id': 3, 'status': 1, 'product': 1},
{'id': 4, 'status': 1, 'product': 1},
{'id': 8, 'status': 1, 'p

Solution

Your code is pretty good, there is one thing that I would add,
the for-else keyword, as this gets rid of the found variable.
Which honestly is just noise.
This is as if a for loop runs completely without breaking then the else will run too. But if it breaks then it won't run the else.
This can leave you with:

def group_by_excluding_key(list_of_dicts, key_field):
    """
    Takes a list of `dict` items and groups by ALL KEYS in the dict EXCEPT the key_field.
    :param list_of_dicts: List of dicts to group
    :param key_field: key field in dict which should be excluded from the grouping
    """
    output = []
    for item in list_of_dicts:
        item_key = item.pop(key_field)
        for existing_group, found_keys in output:
            if existing_group.viewitems() == item.viewitems():
                found_keys.append(item_key)
                break
        else:
            output.append((item, [item_key]))
    return output


Other than that your code is good.

But if I were to were to write this, I'd prefer a very small solution.
Lets say dicts are hash able, what you want is a dictionary that has the modified item as the key, and the popped item_key as the value.
This obviously has two down-sides, it's not ordered, and dicts aren't hash able.
Both easily solved with collections.OrderedDict and tuple(dict.items()).
And so can result in:

from collections import OrderedDict

def group_by_excluding_key(list_of_dicts, key_field):
    """
    Takes a list of `dict` items and groups by ALL KEYS in the dict EXCEPT the key_field.
    :param list_of_dicts: List of dicts to group
    :param key_field: key field in dict which should be excluded from the grouping
    """
    output = OrderedDict()
    for item in list_of_dicts:
        key = item.pop(key_field)
        output.setdefault(tuple(item.items()), []).append(key)
    return [(dict(key), value) for key, value in output.items()]


This has the benefit of moving the for loop into the OrderedDict, and possibly getting \$O(1)\$ key lookup, but requires you to change the type of all the keys, twice.

I know you didn't ask for a performance review, but the performance difference between my code and your code can be tested with the following. The comments are my functions run time over yours as a percentage, so if mine took 0.8s and yours 3.3s then it'll be 24%, followed by how long it took my function to run.

from timeit import timeit
from itertools import count

# 240%, 0.1s
c = count(1)
l = [{'status': 1, 'product': i, 'id': next(c)} for i in range(10)]
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key_dict as fn', number=1000))
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key as fn', number=1000))

# 25%, 0.8s
c = count(1)
l = [{'status': 1, 'product': i, 'id': next(c)} for i in range(100)]
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key_dict as fn', number=1000))
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key as fn', number=1000))

# 28%, 0.8s
c = count(1)
l = [{'status': i, 'product': j, 'id': next(c)} for i in range(10) for j in range(10)]
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key_dict as fn', number=1000))
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key as fn', number=1000))

# 0.4%, 0.9s
c = count(1)
l = [{'status': i, 'product': j, 'id': next(c)} for i in range(100) for j in range(100)]
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key_dict as fn', number=10))
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key as fn', number=10))

Code Snippets

def group_by_excluding_key(list_of_dicts, key_field):
    """
    Takes a list of `dict` items and groups by ALL KEYS in the dict EXCEPT the key_field.
    :param list_of_dicts: List of dicts to group
    :param key_field: key field in dict which should be excluded from the grouping
    """
    output = []
    for item in list_of_dicts:
        item_key = item.pop(key_field)
        for existing_group, found_keys in output:
            if existing_group.viewitems() == item.viewitems():
                found_keys.append(item_key)
                break
        else:
            output.append((item, [item_key]))
    return output
from collections import OrderedDict

def group_by_excluding_key(list_of_dicts, key_field):
    """
    Takes a list of `dict` items and groups by ALL KEYS in the dict EXCEPT the key_field.
    :param list_of_dicts: List of dicts to group
    :param key_field: key field in dict which should be excluded from the grouping
    """
    output = OrderedDict()
    for item in list_of_dicts:
        key = item.pop(key_field)
        output.setdefault(tuple(item.items()), []).append(key)
    return [(dict(key), value) for key, value in output.items()]
from timeit import timeit
from itertools import count

# 240%, 0.1s
c = count(1)
l = [{'status': 1, 'product': i, 'id': next(c)} for i in range(10)]
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key_dict as fn', number=1000))
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key as fn', number=1000))

# 25%, 0.8s
c = count(1)
l = [{'status': 1, 'product': i, 'id': next(c)} for i in range(100)]
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key_dict as fn', number=1000))
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key as fn', number=1000))

# 28%, 0.8s
c = count(1)
l = [{'status': i, 'product': j, 'id': next(c)} for i in range(10) for j in range(10)]
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key_dict as fn', number=1000))
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key as fn', number=1000))

# 0.4%, 0.9s
c = count(1)
l = [{'status': i, 'product': j, 'id': next(c)} for i in range(100) for j in range(100)]
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key_dict as fn', number=10))
print(timeit('fn({!r}, "id")'.format(l), 'from __main__ import group_by_excluding_key as fn', number=10))

Context

StackExchange Code Review Q#142134, answer score: 4

Revisions (0)

No revisions yet.