patternpythonMinor
Check that files don't contain Unicode
Viewed 0 times
containunicodefilesthatcheckdon
Problem
I use this code to check that all files in a directory are free from Unicode and non-printable characters.
Can I improve the structure of the code?
Can I improve the structure of the code?
#!/usr/bin/env python3
import argparse
import os
import string
import sys
def main(input_folder):
for directory, subdirs, files in os.walk(input_folder):
for basename in files:
path = os.path.join(directory, basename)
try:
check_file(path)
except (ValueError, UnicodeDecodeError) as e:
print(e)
sys.exit(1)
print('All files are ok!')
def check_file(path):
with open(path, encoding='utf-8') as fp:
try:
data = fp.read()
except UnicodeDecodeError:
raise ValueError('Warning! {} contains non ascii characters'.format(path))
if not is_printable(data):
raise ValueError('Warning! {} contains non printable characters'.format(path))
def is_printable(s):
return all(c in string.printable for c in s)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Verify correctness of files.')
parser.add_argument('--input', required=True, help='Input folder')
args = parser.parse_args()
main(args.input)Solution
Your problem specification is unclear, and the code may be wrong as a result. What exactly do you mean by "Unicode characters"? What if a file contains just two bytes,
0x20 and 0x5D — is that two ASCII characters (a Space ` and a Right Square Bracket ]), or is that one UTF-16 character (U+205D Tricolon ⁝)?
Unless you have a clear definition of what you mean by "Unicode", and have a special reason for detecting non-ASCII characters specifically, you may be better off opening the file with a latin_1 encoding and just checking is_printable(). In UTF-8, any multibyte sequence will have the leading bit set in all but the last byte, which would place it outside the string.printable range. You would reject the same files as before, just with a less specific reason given.
For scalability, it would be better to read fixed-size blocks (perhaps around 8 kB) instead of the entire file at once. It might also be more efficient to check c in set(string.printable) rather than c in string.printable`.Context
StackExchange Code Review Q#129394, answer score: 2
Revisions (0)
No revisions yet.