HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

Copying files as fast as possible

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fastpossiblefilescopying

Problem

I am running my shell script on machineA which copies the files from machineB and machineC to machineA.

If the file is not there in machineB, then it should be there in machineC for sure. So I will try to copy from machineB first, if it is not there in machineB then I will go to machineC to copy the same files.

In machineB and machineC there will be a folder like this YYYYMMDD inside this folder:

/data/pe_t1_snapshot


Whatever date is the latest date in this format YYYYMMDD inside the above folder - I will pick that folder as the full path from where I need to start copying the files.

Suppose, if this is the latest date folder 20140317 inside /data/pe_t1_snapshot, then this will be the full path for me:

/data/pe_t1_snapshot/20140317


from where I need to start copying the files in machineB and machineC. I need to copy around 400 files in machineA from machineB and machineC and each file size is 3.5 GB.

I currently have my below shell script which works fine as I am using scp, but somehow it takes ~3 hours to copy the 400 files in machineA.

Below is my shell script:

```
#!/bin/bash

readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MEMORY_MAPPED_LOCATION=/data/pe_t1_snapshot
PRIMARY_PARTITION=(0 3 5 7 9) # this will have more file numbers around 200
SECONDARY_PARTITION=(1 2 4 6 8) # this will have more file numbers around 200

dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)

echo $dir1
echo $dir2

if [ "$dir1" = "$dir2" ]
then
# delete all the files first
find "$PRIMARY" -mindepth 1 -delete
for el in "${PRIMARY_PARTITION[@]}

Solution

You have already posted another question on this same topic two weeks later, with a better solution using GNU parallel.
I'll review this one too anyway on its own merit,
though it might be a bit of a moot point.

It's not a good idea to set StrictHostKeyChecking=no when using ssh.
The host key of servers should normally not change.
When they do, and you don't know why,
it might be a man in the middle attack.
If it's part of a scheduled server update,
you can manually update the ~/.ssh/known_hosts file accordingly.

When filtering the output of ls like this:

ls -dt1 path | head -n1


you don't need the -1 flag.
The purpose of that flag is to print the list of files in a single column.
But when you pipe the output to another command,
the output will be always a single column.

Instead of this:

if [ "$dir1" = "$dir2" ]


This is easier (because you don't need to quote the variables) and more modern:

if [[ $dir1 = $dir2 ]]


Instead of cramming so many ssh options on the command line:

scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@machineA:...


it's better to add these options in the ~/.ssh/config file, like this:

Host machineA
Hostname machineA
User david
ControlMaster auto
ControlPath ~/.ssh/control-%r@%h:%p
ControlPersist 900


This way the scp command becomes simply:

scp machineA:...


I hope this (and even more, my other answer) helps!

Code Snippets

ls -dt1 path | head -n1
if [ "$dir1" = "$dir2" ]
if [[ $dir1 = $dir2 ]]
scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@machineA:...
Host machineA
Hostname machineA
User david
ControlMaster auto
ControlPath ~/.ssh/control-%r@%h:%p
ControlPersist 900

Context

StackExchange Code Review Q#48898, answer score: 2

Revisions (0)

No revisions yet.