patternpythonMinor
Remove non-printable characters from string in Python 3
Viewed 0 times
nonremoveprintablepythoncharactersfromstring
Problem
My aim is to print bytes string lines in Python 3, but truncate each line to a maximum width so that it fits the current terminal window.
My first attempt was only
Therefore now I have this snippet of code,
However, I guess it's pretty slow to refactor each string line this way just to filter out non-printable characters like
Please note that
Can you please suggest me how to improve this code's performance?
My first attempt was only
print(output[:max_width]), but this did not work because Python counts e.g. tabs \t as one character, but the terminal displays them as multiple characters. Also, it would evaluate carriage-returns etc.Therefore now I have this snippet of code,
line is a bytes string:output = line.decode(codec, "replace")
if max_width:
output = "".join(c for c in output if c.isprintable())
print(output[:max_width])
else:
print(output)
However, I guess it's pretty slow to refactor each string line this way just to filter out non-printable characters like
\t and \r (and whatever characters I might have forgotten).Please note that
codec is specified by the user. It might be "ascii", utf-8, utf-16 or any other valid built-in codec.Can you please suggest me how to improve this code's performance?
Solution
Something that may help performance wise could be
This will allow you to call
as this is a binary file that may not have many
This on it's own may not help on files that have a lot of
The bottle neck in these file would most likely be the overhead incurred by
And so it's much faster to build a string to display once.
In these cases you would want to use something like:
(Untested code)
However the above is not good on machines with limited memory or extremely large files.
In these cases you would want to use a buffer and print the buffer when a threshold has been reached.
After PEP 3138 your method to remove non-printables seems to be the correct way.
itertools.islice.This will allow you to call
str.isprintable() max_width amount of times,as this is a binary file that may not have many
\ns it can save a lot of effort.output = line.decode(codec, "replace")
if max_width:
print("".join(itertools.islice((c for c in output if c.isprintable()), max_width)))
else:
print(output)This on it's own may not help on files that have a lot of
\ns.The bottle neck in these file would most likely be the overhead incurred by
print.And so it's much faster to build a string to display once.
In these cases you would want to use something like:
(Untested code)
def read_data(path):
with open(path) as f:
for line in f:
output = line.decode(codec, "replace")
if max_width:
yield "".join(itertools.islice(
(c for c in output if c.isprintable()),
max_width))
else:
yield output
print('\n'.join(read_data(...)))However the above is not good on machines with limited memory or extremely large files.
In these cases you would want to use a buffer and print the buffer when a threshold has been reached.
After PEP 3138 your method to remove non-printables seems to be the correct way.
Code Snippets
output = line.decode(codec, "replace")
if max_width:
print("".join(itertools.islice((c for c in output if c.isprintable()), max_width)))
else:
print(output)def read_data(path):
with open(path) as f:
for line in f:
output = line.decode(codec, "replace")
if max_width:
yield "".join(itertools.islice(
(c for c in output if c.isprintable()),
max_width))
else:
yield output
print('\n'.join(read_data(...)))Context
StackExchange Code Review Q#123448, answer score: 3
Revisions (0)
No revisions yet.