HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Simple string compression in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
compressionsimplestringpython

Problem

This is my solution to exercise 1.6 in Cracking the Coding Interview. I am interested in receiving feedback on the coding style and time/space complexity.

The exercise statement is:


Implement a method to perform basic string compression using the counts of repeated characters. For example, the string aabcccccaaa would become a2b1c5a3. If the compressed string would not become smaller than the original string your method should return the original string. You can assume the string has only uppercase and lowercase letters (a-z).

```
import unittest
from collections import defaultdict
from math import log10, floor

def compress_string(data: str) -> str:
"""A function to perform basic string compression using the counts of repeated characters
If the compressed string is not smaller than the original string the function returns
the original string. The assumption is that the string has only uppercase and lowercase letters (a-z)."""
curr_char_pos = 0
frequencies = defaultdict(int)
compressed_string = []
compressed_string_size = 0

for idx in range(len(data)):
if compressed_string_size >= len(data):
break
if data[idx] == data[curr_char_pos]:
frequencies[curr_char_pos] += 1
else:
compressed_string.append(data[curr_char_pos])
compressed_string.append(frequencies[curr_char_pos])
compressed_string_size += floor(log10(frequencies[curr_char_pos])) + 2
curr_char_pos = idx
frequencies[curr_char_pos] = 1

compressed_string.append(data[curr_char_pos])
compressed_string.append(frequencies[curr_char_pos])
compressed_string_size += floor(log10(frequencies[curr_char_pos])) + 2

if compressed_string_size < len(data):
compressed_data = ''.join(str(char) for char in compressed_string)
return compressed_data
else:
return data

class MyTest(unittest.TestCase):

Solution

This is called Run-length-encoding and there are many implementations of this on the Internet, for example: http://rosettacode.org/wiki/Run-length_encoding#Python

I have written a solution myself just now but posting it would not really help (it would be like adding another drop in the ocean) so I will comment your code thoroughly instead:

optimization?

This was probably added as an optimization:

if compressed_string_size >= len(data):
        break


But except for inputs with no consecutive runs whatsoever, the compressed size will exceed the real size only near the end of the compression. Does the overhead of the additional conditional check each loop iteration take less time than the iterations saved? If it does, does it save enough time to justify adding complexity? I suggest you first worry about writing a working program and then apply optimizations with profiling (execution timing).

Additional variable not needed

compressed_string_size = 0


This was probably added as an optimization too, but is it worth it? No, because len runs in O(1) time complexity that is, len does not need to scan the whole list, because lists remember their length, so calling len is almost instantaneous, in fact using that variable is likely a pessimization as log is a very expensive mathematical operation.

Code Snippets

if compressed_string_size >= len(data):
        break
compressed_string_size = 0

Context

StackExchange Code Review Q#141656, answer score: 2

Revisions (0)

No revisions yet.