patternbashMinor
Parallelizing upload
Viewed 0 times
uploadparallelizingstackoverflow
Problem
I needed smth that could send (scp/rsync) many files in parallel but that would not overload machine, firewalls on the way by starting e.g. 600 simultaneous uploads, so upload should be done in batches.
Most of the utilities like
Pls keep in mind that it's the draft that I quote below, I do not need details on general good practices such as setting paths and counters using variables and not literal values.
Any better approach? Problems? I've noticed that I have to
Most of the utilities like
aria2c are download managers and I needed something to send many files in parallel from a machine behind NAT to a machine on the internet (so no dl from internet possible).Pls keep in mind that it's the draft that I quote below, I do not need details on general good practices such as setting paths and counters using variables and not literal values.
Any better approach? Problems? I've noticed that I have to
sleep 1 between starting scp cmds in background or else target server refuses some connections.#!/bin/bash
FLIST=file_paths_list.txt
COUNTER=0
PIDS=()
cat ${FLIST} | while read F; do
COUNTER=$((COUNTER + 1))
echo COUNTER $COUNTER
sleep 1
scp ${F} root@host:/data/tmp &
PID=$!
echo Adding PID $PID
PIDS+=($PID)
if [ $COUNTER -lt 20 ]; then
continue
fi
# wait for uploads batch to complete
for PID in ${PIDS[@]}; do
echo waiting for PID $PID
wait $PID
done
# reset PIDS array
PIDS=()
COUNTER=0
doneSolution
It is very, very unlikely that running multiple files in parallel will go any faster than running them all sequentially (your bottleneck will be 'internet' bandwidth regardless).
Is there something wrong with doing something like:
The above script will convert newlines to null characters in the input list, then copy the files to the server.
You can add the
The above will copy only one stream at a time, but, it will do the files in bulk operations, and you will likely be limited by your network bandwidth, not the parallelism of your copies.
EDIT:
If you want to run the scp's in parallel, you can add the
Is there something wrong with doing something like:
#!/bin/bash
SCPCMD="scp \"\$@\" root@host:/data/tmp"
# echo The scp command is: $SCPCMD
cat ${FLIST} | tr '\n' '\000' | xargs --null bash -c "${SCPCMD}"The above script will convert newlines to null characters in the input list, then copy the files to the server.
You can add the
-n 10 argument to xargs to do 10 files are a time, or whatever works for you. I like the -t argument as well which echos the xarg command to stderr before it runs it.The above will copy only one stream at a time, but, it will do the files in bulk operations, and you will likely be limited by your network bandwidth, not the parallelism of your copies.
EDIT:
If you want to run the scp's in parallel, you can add the
-P xx (--max-procs)argument to xargs, which will run as many as xx scp's in parallel for you. So, for example, if you have hundreds of files, you can scp them in 5 parallel streams, 5 files at a time, with:cat ${FLIST} | tr '\n' '\000' | xargs --null -P 5 -n 5 bash -c "${SCPCMD}"Code Snippets
#!/bin/bash
SCPCMD="scp \"\$@\" root@host:/data/tmp"
# echo The scp command is: $SCPCMD
cat ${FLIST} | tr '\n' '\000' | xargs --null bash -c "${SCPCMD}"cat ${FLIST} | tr '\n' '\000' | xargs --null -P 5 -n 5 bash -c "${SCPCMD}"Context
StackExchange Code Review Q#39377, answer score: 4
Revisions (0)
No revisions yet.