HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Python: Loop through folders, convert XLSX files to CSV removing non ASCII

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
removingnonconvertcsvloopfoldersfilesasciipythonthrough

Problem

As the subject suggests:

  • Loop recursively through folders



  • All files are XLSX



  • Remove all non ASCII characters



  • Save as a CSV



Can it be improved with regards to speed? Would using openpyxl help?

import os
import xlrd
import csv

for subdir, dirs, files in os.walk("C:\Users\Alan\Downloads\Knowledge"):
    for file in files:
        filepath = subdir + os.sep + file

        wb = xlrd.open_workbook(filepath)
        sh = wb.sheet_by_index(0)
        csv_file = open(filepath.replace(".xlsx","")+'_csv.csv','wb')
        wr = csv.writer(csv_file,quoting=csv.QUOTE_ALL)

        for rownum in xrange(sh.nrows):
            wr.writerow([unicode(val).encode('ascii','ignore') for val in sh.row_values(rownum)])

        csv_file.close()

Solution

I'd have a couple of suggestions which you may want to follow:

  • Use openpyxl module to read xlsx documents, and the csv module to write.



An example might look like this:

import openpyxl
import csv

wb = openpyxl.load_workbook('test.xlsx')
sh = wb.get_active_sheet()
with open('test.csv', 'wb') as f:
    c = csv.writer(f)
    for r in sh.rows:
        # here, you can also filter non-ascii characters
        c.writerow([cell.value for cell in r])


  • Don't use os.sep.



From docs:


[...] Note that knowing this is not sufficient to be able to parse or
concatenate path names...

Instead, you could use normpath from os.path to normalize the path.

From the docs:


Normalize a pathname by collapsing redundant separators and up-level
references so that A//B, A/B/, A/./B and A/foo/../B all become A/B.
This string manipulation may change the meaning of a path that
contains symbolic links. On Windows, it converts forward slashes to
backward slashes.

  • Don't use + separator to join certain paths. Instead, just use os.path.join().



As for the speed improvement, you'll need to go through all the subfolders in your parent folder and find all the .xslx files. Perhaps skipping non-xlsx files will bring a bit of speed: str.endswith('.xslx').

Code Snippets

import openpyxl
import csv

wb = openpyxl.load_workbook('test.xlsx')
sh = wb.get_active_sheet()
with open('test.csv', 'wb') as f:
    c = csv.writer(f)
    for r in sh.rows:
        # here, you can also filter non-ascii characters
        c.writerow([cell.value for cell in r])

Context

StackExchange Code Review Q#146380, answer score: 2

Revisions (0)

No revisions yet.