patternbashMinor
Is there a 'better' way to find files from a list in a directory tree
Viewed 0 times
fromdirectorywaybetterfilesfindlisttheretree
Problem
I have created a list of files using
The find command is simply
I would now like to use this list to find copies of these files in other directories.
The 'twist' in my requirements is that I want to search only for the 'base' of the file name. I don't want to include the extension in the search.
Example:
Here is what I wrote. It works. It seems to me that there is a more efficient way to do this.
find, foundlist.lst.The find command is simply
find . -type f -name "" > foundlist.lstI would now like to use this list to find copies of these files in other directories.
The 'twist' in my requirements is that I want to search only for the 'base' of the file name. I don't want to include the extension in the search.
Example:
./sort.cc is a member of the list. I want to look for all files of the pattern sort.*Here is what I wrote. It works. It seems to me that there is a more efficient way to do this.
./findfiles.sh foundfiles.lst /usr/bin/temp
#!/bin/bash
# findfiles.sh
if [ $# -ne 2 ]; then
echo "Need two arguments"
echo "usage: findfiles "
else
filename=$1
echo "$filename"
while read -r line; do
name=$line
# change './file.ext' to 'file.*'
search_base=$( echo ${name} | sed "s%\.\/%%" | sed "s/\..*/\.\*/" )
find $2 -type f -name $search_base
done < $filename
fiSolution
In think you can do three things to improve your script(s):
The first aspects comes about because you compare the ALL files in the first directory to ALL files in the second directory. If both directories scale more or less by N you will have a rather slow algorithm for really large directories.
The second aspect relates to calling
The third aspect relates to e.g. calling
I have prepared a slightly different approach which is far from elegant and/or perfect but which may give you some ideas. The key features are:
This is the script:
Note that the output loop still has complexity O(M^2) with the inner loop being in the
Also note the first
- Try to avoid complexity O(N^2)
- Try to reuse intermediate results
- Try to reduce extra process calls
The first aspects comes about because you compare the ALL files in the first directory to ALL files in the second directory. If both directories scale more or less by N you will have a rather slow algorithm for really large directories.
The second aspect relates to calling
find again and again instead of calling it once and then using the output of the call for comparison.The third aspect relates to e.g. calling
sed twice in a pipe for something that can be done by the bash itself.I have prepared a slightly different approach which is far from elegant and/or perfect but which may give you some ideas. The key features are:
- It only calls
findonce for each directory and stores the result in files.
- Instead of a nested loop (with
findbeing the inner loop) it usessortwith complexity O(NlogN) to find the matches.
- It focuses on the basenames of the filenames for comparison. Only for the result the "long" names are looked up again.
This is the script:
#!/bin/bash
dir1=$1
dir2=$2
# list all files in dir a using format "filename filepath"
find $dir1 -type f -printf "%f %p\n" > dir1.lst
# sort filename portion alphabetically and append " 1" as source
cat dir1.lst | while read filename filepath
do
echo ${filename%.*}
done | sort -u | while read filename
do
echo "$filename 1"
done > dir1base.lst
# list all files in dir b using format "filename filepath"
find $dir2 -type f -printf "%f %p\n" > dir2.lst
# sort filename portion alphabetically and append " 2" as source
cat dir2.lst | while read filename filepath
do
echo ${filename%.*}
done | sort -u | while read filename
do
echo "$filename 2"
done > dir2base.lst
# now merge the two lists together; sort them by filename first and by source second
last=""
sort -k 1 -k 2 -u dir1base.lst dir2base.lst | while read filename dir
do
# if we find an entry from source 2 that had the same filename from source 1 just before we have a match
if [[ $dir -eq 2 && $last == $filename ]] ; then
# output match
echo $filename
fi
last=$filename
done | while read match
do
# now grep for matches in both original files
echo "*** File(s)"
grep $match dir1.lst
echo "*** match file(s)"
grep $match dir2.lst
echo "---"
doneNote that the output loop still has complexity O(M^2) with the inner loop being in the
grep call. However, in this case M scales with the number of matches and not the original input size. The former should be considerably smaller.Also note the first
find command will have to be adapted to be more selective than it is now.Code Snippets
#!/bin/bash
dir1=$1
dir2=$2
# list all files in dir a using format "filename filepath"
find $dir1 -type f -printf "%f %p\n" > dir1.lst
# sort filename portion alphabetically and append " 1" as source
cat dir1.lst | while read filename filepath
do
echo ${filename%.*}
done | sort -u | while read filename
do
echo "$filename 1"
done > dir1base.lst
# list all files in dir b using format "filename filepath"
find $dir2 -type f -printf "%f %p\n" > dir2.lst
# sort filename portion alphabetically and append " 2" as source
cat dir2.lst | while read filename filepath
do
echo ${filename%.*}
done | sort -u | while read filename
do
echo "$filename 2"
done > dir2base.lst
# now merge the two lists together; sort them by filename first and by source second
last=""
sort -k 1 -k 2 -u dir1base.lst dir2base.lst | while read filename dir
do
# if we find an entry from source 2 that had the same filename from source 1 just before we have a match
if [[ $dir -eq 2 && $last == $filename ]] ; then
# output match
echo $filename
fi
last=$filename
done | while read match
do
# now grep for matches in both original files
echo "*** File(s)"
grep $match dir1.lst
echo "*** match file(s)"
grep $match dir2.lst
echo "---"
doneContext
StackExchange Code Review Q#37511, answer score: 3
Revisions (0)
No revisions yet.