patternbashMinor
Copying files as fast as possible
Viewed 0 times
fastpossiblefilescopying
Problem
I am running my shell script on
If the file is not there in
In
Whatever date is the latest date in this format
Suppose, if this is the latest date folder
from where I need to start copying the files in
I currently have my below shell script which works fine as I am using scp, but somehow it takes ~3 hours to copy the 400 files in
Below is my shell script:
```
#!/bin/bash
readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MEMORY_MAPPED_LOCATION=/data/pe_t1_snapshot
PRIMARY_PARTITION=(0 3 5 7 9) # this will have more file numbers around 200
SECONDARY_PARTITION=(1 2 4 6 8) # this will have more file numbers around 200
dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
echo $dir1
echo $dir2
if [ "$dir1" = "$dir2" ]
then
# delete all the files first
find "$PRIMARY" -mindepth 1 -delete
for el in "${PRIMARY_PARTITION[@]}
machineA which copies the files from machineB and machineC to machineA.If the file is not there in
machineB, then it should be there in machineC for sure. So I will try to copy from machineB first, if it is not there in machineB then I will go to machineC to copy the same files. In
machineB and machineC there will be a folder like this YYYYMMDD inside this folder:/data/pe_t1_snapshotWhatever date is the latest date in this format
YYYYMMDD inside the above folder - I will pick that folder as the full path from where I need to start copying the files.Suppose, if this is the latest date folder
20140317 inside /data/pe_t1_snapshot, then this will be the full path for me:/data/pe_t1_snapshot/20140317from where I need to start copying the files in
machineB and machineC. I need to copy around 400 files in machineA from machineB and machineC and each file size is 3.5 GB.I currently have my below shell script which works fine as I am using scp, but somehow it takes ~3 hours to copy the 400 files in
machineA.Below is my shell script:
```
#!/bin/bash
readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MEMORY_MAPPED_LOCATION=/data/pe_t1_snapshot
PRIMARY_PARTITION=(0 3 5 7 9) # this will have more file numbers around 200
SECONDARY_PARTITION=(1 2 4 6 8) # this will have more file numbers around 200
dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
echo $dir1
echo $dir2
if [ "$dir1" = "$dir2" ]
then
# delete all the files first
find "$PRIMARY" -mindepth 1 -delete
for el in "${PRIMARY_PARTITION[@]}
Solution
You have already posted another question on this same topic two weeks later, with a better solution using GNU parallel.
I'll review this one too anyway on its own merit,
though it might be a bit of a moot point.
It's not a good idea to set
The host key of servers should normally not change.
When they do, and you don't know why,
it might be a man in the middle attack.
If it's part of a scheduled server update,
you can manually update the
When filtering the output of
you don't need the
The purpose of that flag is to print the list of files in a single column.
But when you pipe the output to another command,
the output will be always a single column.
Instead of this:
This is easier (because you don't need to quote the variables) and more modern:
Instead of cramming so many
it's better to add these options in the
This way the
I hope this (and even more, my other answer) helps!
I'll review this one too anyway on its own merit,
though it might be a bit of a moot point.
It's not a good idea to set
StrictHostKeyChecking=no when using ssh.The host key of servers should normally not change.
When they do, and you don't know why,
it might be a man in the middle attack.
If it's part of a scheduled server update,
you can manually update the
~/.ssh/known_hosts file accordingly.When filtering the output of
ls like this:ls -dt1 path | head -n1you don't need the
-1 flag.The purpose of that flag is to print the list of files in a single column.
But when you pipe the output to another command,
the output will be always a single column.
Instead of this:
if [ "$dir1" = "$dir2" ]This is easier (because you don't need to quote the variables) and more modern:
if [[ $dir1 = $dir2 ]]Instead of cramming so many
ssh options on the command line:scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@machineA:...it's better to add these options in the
~/.ssh/config file, like this:Host machineA
Hostname machineA
User david
ControlMaster auto
ControlPath ~/.ssh/control-%r@%h:%p
ControlPersist 900This way the
scp command becomes simply:scp machineA:...I hope this (and even more, my other answer) helps!
Code Snippets
ls -dt1 path | head -n1if [ "$dir1" = "$dir2" ]if [[ $dir1 = $dir2 ]]scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@machineA:...Host machineA
Hostname machineA
User david
ControlMaster auto
ControlPath ~/.ssh/control-%r@%h:%p
ControlPersist 900Context
StackExchange Code Review Q#48898, answer score: 2
Revisions (0)
No revisions yet.