HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

Parallelizing upload

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
uploadparallelizingstackoverflow

Problem

I needed smth that could send (scp/rsync) many files in parallel but that would not overload machine, firewalls on the way by starting e.g. 600 simultaneous uploads, so upload should be done in batches.

Most of the utilities like aria2c are download managers and I needed something to send many files in parallel from a machine behind NAT to a machine on the internet (so no dl from internet possible).

Pls keep in mind that it's the draft that I quote below, I do not need details on general good practices such as setting paths and counters using variables and not literal values.

Any better approach? Problems? I've noticed that I have to sleep 1 between starting scp cmds in background or else target server refuses some connections.

#!/bin/bash

FLIST=file_paths_list.txt

COUNTER=0
PIDS=()

cat ${FLIST} | while read F; do
                COUNTER=$((COUNTER + 1))
                echo COUNTER $COUNTER
                sleep 1
                scp ${F} root@host:/data/tmp &
                PID=$!
                echo Adding PID $PID
                PIDS+=($PID)
                if [ $COUNTER -lt 20 ]; then
                        continue
                fi
                # wait for uploads batch to complete
                for PID in ${PIDS[@]}; do
                        echo waiting for PID $PID
                        wait $PID
                done
                # reset PIDS array
                PIDS=()
                COUNTER=0
done

Solution

It is very, very unlikely that running multiple files in parallel will go any faster than running them all sequentially (your bottleneck will be 'internet' bandwidth regardless).

Is there something wrong with doing something like:

#!/bin/bash

SCPCMD="scp \"\$@\" root@host:/data/tmp"
# echo The scp command is: $SCPCMD

cat ${FLIST} | tr '\n' '\000' | xargs --null bash -c "${SCPCMD}"


The above script will convert newlines to null characters in the input list, then copy the files to the server.

You can add the -n 10 argument to xargs to do 10 files are a time, or whatever works for you. I like the -t argument as well which echos the xarg command to stderr before it runs it.

The above will copy only one stream at a time, but, it will do the files in bulk operations, and you will likely be limited by your network bandwidth, not the parallelism of your copies.

EDIT:

If you want to run the scp's in parallel, you can add the -P xx (--max-procs)argument to xargs, which will run as many as xx scp's in parallel for you. So, for example, if you have hundreds of files, you can scp them in 5 parallel streams, 5 files at a time, with:

cat ${FLIST} | tr '\n' '\000' | xargs --null -P 5 -n 5 bash -c "${SCPCMD}"

Code Snippets

#!/bin/bash

SCPCMD="scp \"\$@\" root@host:/data/tmp"
# echo The scp command is: $SCPCMD

cat ${FLIST} | tr '\n' '\000' | xargs --null bash -c "${SCPCMD}"
cat ${FLIST} | tr '\n' '\000' | xargs --null -P 5 -n 5 bash -c "${SCPCMD}"

Context

StackExchange Code Review Q#39377, answer score: 4

Revisions (0)

No revisions yet.