HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Remove non-printable characters from string in Python 3

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nonremoveprintablepythoncharactersfromstring

Problem

My aim is to print bytes string lines in Python 3, but truncate each line to a maximum width so that it fits the current terminal window.

My first attempt was only print(output[:max_width]), but this did not work because Python counts e.g. tabs \t as one character, but the terminal displays them as multiple characters. Also, it would evaluate carriage-returns etc.

Therefore now I have this snippet of code, line is a bytes string:

output = line.decode(codec, "replace")
if max_width:
output = "".join(c for c in output if c.isprintable())
print(output[:max_width])
else:
print(output)


However, I guess it's pretty slow to refactor each string line this way just to filter out non-printable characters like \t and \r (and whatever characters I might have forgotten).

Please note that codec is specified by the user. It might be "ascii", utf-8, utf-16 or any other valid built-in codec.

Can you please suggest me how to improve this code's performance?

Solution

Something that may help performance wise could be itertools.islice.
This will allow you to call str.isprintable() max_width amount of times,
as this is a binary file that may not have many \ns it can save a lot of effort.

output = line.decode(codec, "replace")
if max_width:
    print("".join(itertools.islice((c for c in output if c.isprintable()), max_width)))
else:
    print(output)


This on it's own may not help on files that have a lot of \ns.
The bottle neck in these file would most likely be the overhead incurred by print.
And so it's much faster to build a string to display once.
In these cases you would want to use something like:

(Untested code)

def read_data(path):
    with open(path) as f:
        for line in f:
            output = line.decode(codec, "replace")
            if max_width:
                yield "".join(itertools.islice(
                    (c for c in output if c.isprintable()),
                    max_width))
            else:
                yield output

print('\n'.join(read_data(...)))


However the above is not good on machines with limited memory or extremely large files.
In these cases you would want to use a buffer and print the buffer when a threshold has been reached.

After PEP 3138 your method to remove non-printables seems to be the correct way.

Code Snippets

output = line.decode(codec, "replace")
if max_width:
    print("".join(itertools.islice((c for c in output if c.isprintable()), max_width)))
else:
    print(output)
def read_data(path):
    with open(path) as f:
        for line in f:
            output = line.decode(codec, "replace")
            if max_width:
                yield "".join(itertools.islice(
                    (c for c in output if c.isprintable()),
                    max_width))
            else:
                yield output

print('\n'.join(read_data(...)))

Context

StackExchange Code Review Q#123448, answer score: 3

Revisions (0)

No revisions yet.