HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Finding all non-empty directories and their files on an SFTP server with Paramiko

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
sftpallnonserverwithemptyparamikofilesfindingand

Problem

The purpose of the following function is to find all non-empty directories, and the files in those non-empty directories. It recursively checks each directory on an SFTP server to see if it has any files, and if it does, adds it to a default dict using the path as the key. The function uses paramiko.SFTPClient and stat. I am specifically concerned about the performance; it is rather slow.

Prereqsuite information

  • sftp.listdir_attr returns a list of SFTPAttributes which represent either files, directories, symlinks, etc., and contain a st_mode, which is used to determine if it is a directory or file. This can throw an IOException for example if you don't have permissions to inspect the path.



  • stat.S_ISDIR will inspect the mode to determine if its a directory



The function in question:

def recursive_ftp(sftp, path='.', files=None):
    if files is None:
        files = defaultdict(list)

    # loop over list of SFTPAttributes (files with modes)
    for attr in sftp.listdir_attr(path):

        if stat.S_ISDIR(attr.st_mode):
            # If the file is a directory, recurse it
            recursive_ftp(sftp, os.path.join(path,attr.filename), files)

        else:
            #  if the file is a file, add it to our dict
            files[path].append(attr.filename)

    return files


Use:

import paramiko
import stat
transport = paramiko.Transport((host, port))
transport.connect(username=username, password=password)
sftp = paramiko.SFTPClient.from_transport(transport)

files = recursive_ftp(sftp)


If we have an SFTP server that looks like this:

/foo
----a.csv
----b.csv
/bar
----c.csv
/baz


The function will return a dictionary like so:

{
    './foo': ['a.csv', 'b.csv'],
    './bar': ['c.csv']
}

Solution

There is nothing obviously wrong with your implementation that could explain a slow behaviour. The slowest part here being the use of listdir_attr, you might want to check with other means if its speed matches what your network has to offer.

That being said, there are a few changes you can do to improve a bit on your end:

  • use a helper function so files will not be both a return value and modified in place;



  • use paramiko simulation of a working directory to remove the need for os.path;



  • use list-comprehension to remove the need for defaultdict.



I'm also wondering whether you really want to list everything that is not a directory or only regular files (i.e. no symlinks, no block devices, etc.) You can change the proposed list-comprehension accordingly.
Proposed improvements

def _sftp_helper(sftp, files):
    stats = sftp.listdir_attr('.')
    files[sftp.getcwd()] = [attr.filename for attr in stats if stat.S_ISREG(attr.st_mode)]

    for attr in stats:
        if stat.S_ISDIR(attr.st_mode):  # If the file is a directory, recurse it
            sftp.chdir(attr.filename)
            _sftp_helper(sftp, files)
            sftp.chdir('..')

def filelist_recursive(sftp):
    files = {}
    _sftp_helper(sftp, files)
    return files


You can adapt easily to include back the optional path parameter into filelist_recursive.

Code Snippets

def _sftp_helper(sftp, files):
    stats = sftp.listdir_attr('.')
    files[sftp.getcwd()] = [attr.filename for attr in stats if stat.S_ISREG(attr.st_mode)]

    for attr in stats:
        if stat.S_ISDIR(attr.st_mode):  # If the file is a directory, recurse it
            sftp.chdir(attr.filename)
            _sftp_helper(sftp, files)
            sftp.chdir('..')

def filelist_recursive(sftp):
    files = {}
    _sftp_helper(sftp, files)
    return files

Context

StackExchange Code Review Q#127180, answer score: 5

Revisions (0)

No revisions yet.