snippetpythonMinor
Python: Loop through folders, convert XLSX files to CSV removing non ASCII
Viewed 0 times
removingnonconvertcsvloopfoldersfilesasciipythonthrough
Problem
As the subject suggests:
Can it be improved with regards to speed? Would using
- Loop recursively through folders
- All files are XLSX
- Remove all non ASCII characters
- Save as a CSV
Can it be improved with regards to speed? Would using
openpyxl help?import os
import xlrd
import csv
for subdir, dirs, files in os.walk("C:\Users\Alan\Downloads\Knowledge"):
for file in files:
filepath = subdir + os.sep + file
wb = xlrd.open_workbook(filepath)
sh = wb.sheet_by_index(0)
csv_file = open(filepath.replace(".xlsx","")+'_csv.csv','wb')
wr = csv.writer(csv_file,quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow([unicode(val).encode('ascii','ignore') for val in sh.row_values(rownum)])
csv_file.close()Solution
I'd have a couple of suggestions which you may want to follow:
An example might look like this:
From docs:
[...] Note that knowing this is not sufficient to be able to parse or
concatenate path names...
Instead, you could use
From the docs:
Normalize a pathname by collapsing redundant separators and up-level
references so that
This string manipulation may change the meaning of a path that
contains symbolic links. On Windows, it converts forward slashes to
backward slashes.
As for the speed improvement, you'll need to go through all the subfolders in your parent folder and find all the .xslx files. Perhaps skipping non-xlsx files will bring a bit of speed:
- Use
openpyxlmodule to read xlsx documents, and thecsvmodule to write.
An example might look like this:
import openpyxl
import csv
wb = openpyxl.load_workbook('test.xlsx')
sh = wb.get_active_sheet()
with open('test.csv', 'wb') as f:
c = csv.writer(f)
for r in sh.rows:
# here, you can also filter non-ascii characters
c.writerow([cell.value for cell in r])- Don't use
os.sep.
From docs:
[...] Note that knowing this is not sufficient to be able to parse or
concatenate path names...
Instead, you could use
normpath from os.path to normalize the path.From the docs:
Normalize a pathname by collapsing redundant separators and up-level
references so that
A//B, A/B/, A/./B and A/foo/../B all become A/B.This string manipulation may change the meaning of a path that
contains symbolic links. On Windows, it converts forward slashes to
backward slashes.
- Don't use
+separator to join certain paths. Instead, just useos.path.join().
As for the speed improvement, you'll need to go through all the subfolders in your parent folder and find all the .xslx files. Perhaps skipping non-xlsx files will bring a bit of speed:
str.endswith('.xslx').Code Snippets
import openpyxl
import csv
wb = openpyxl.load_workbook('test.xlsx')
sh = wb.get_active_sheet()
with open('test.csv', 'wb') as f:
c = csv.writer(f)
for r in sh.rows:
# here, you can also filter non-ascii characters
c.writerow([cell.value for cell in r])Context
StackExchange Code Review Q#146380, answer score: 2
Revisions (0)
No revisions yet.