patternbashMinor

Is there a 'better' way to find files from a list in a directory tree

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

fromdirectorywaybetterfilesfindlisttheretree

Problem

I have created a list of files using find, foundlist.lst.

The find command is simply find . -type f -name "" > foundlist.lst

I would now like to use this list to find copies of these files in other directories.

The 'twist' in my requirements is that I want to search only for the 'base' of the file name. I don't want to include the extension in the search.

Example:

./sort.cc is a member of the list. I want to look for all files of the pattern sort.*

Here is what I wrote. It works. It seems to me that there is a more efficient way to do this.

./findfiles.sh foundfiles.lst /usr/bin/temp

#!/bin/bash
# findfiles.sh
if [ $# -ne 2 ]; then
    echo "Need two arguments"
    echo "usage: findfiles  "
else
    filename=$1
    echo "$filename"
    while read -r line; do
        name=$line
        # change './file.ext' to 'file.*'
        search_base=$( echo ${name} | sed "s%\.\/%%" | sed "s/\..*/\.\*/" )
        find $2 -type f -name $search_base
    done < $filename
fi

Solution

In think you can do three things to improve your script(s):

Try to avoid complexity O(N^2)

Try to reuse intermediate results

Try to reduce extra process calls

The first aspects comes about because you compare the ALL files in the first directory to ALL files in the second directory. If both directories scale more or less by N you will have a rather slow algorithm for really large directories.

The second aspect relates to calling find again and again instead of calling it once and then using the output of the call for comparison.

The third aspect relates to e.g. calling sed twice in a pipe for something that can be done by the bash itself.

I have prepared a slightly different approach which is far from elegant and/or perfect but which may give you some ideas. The key features are:

It only calls find once for each directory and stores the result in files.

Instead of a nested loop (with find being the inner loop) it uses sort with complexity O(NlogN) to find the matches.

It focuses on the basenames of the filenames for comparison. Only for the result the "long" names are looked up again.

This is the script:

#!/bin/bash
dir1=$1
dir2=$2

# list all files in dir a using format "filename filepath"
find $dir1 -type f -printf "%f %p\n" > dir1.lst

# sort filename portion alphabetically and append " 1" as source
cat dir1.lst | while read filename filepath
do
    echo ${filename%.*}
done | sort -u | while read filename 
do
    echo "$filename 1"
done > dir1base.lst

# list all files in dir b using format "filename filepath"
find $dir2 -type f -printf "%f %p\n" > dir2.lst

# sort filename portion alphabetically and append " 2" as source
cat dir2.lst | while read filename filepath
do
    echo ${filename%.*}
done | sort -u | while read filename 
do
    echo "$filename 2"
done > dir2base.lst

# now merge the two lists together; sort them by filename first and by source second
last=""
sort -k 1 -k 2  -u dir1base.lst dir2base.lst | while read filename dir
do
    # if we find an entry from source 2 that had the same filename from source 1 just before we have a match
    if [[ $dir -eq 2 && $last == $filename ]] ; then
        # output match
        echo $filename
    fi
    last=$filename
done | while read match 
do
    # now grep for matches in both original files
    echo "*** File(s)"
    grep $match dir1.lst
    echo "*** match file(s)"
    grep $match dir2.lst
    echo "---"
done

Note that the output loop still has complexity O(M^2) with the inner loop being in the grep call. However, in this case M scales with the number of matches and not the original input size. The former should be considerably smaller.

Also note the first find command will have to be adapted to be more selective than it is now.

Code Snippets

#!/bin/bash
dir1=$1
dir2=$2

# list all files in dir a using format "filename filepath"
find $dir1 -type f -printf "%f %p\n" > dir1.lst

# sort filename portion alphabetically and append " 1" as source
cat dir1.lst | while read filename filepath
do
    echo ${filename%.*}
done | sort -u | while read filename 
do
    echo "$filename 1"
done > dir1base.lst

# list all files in dir b using format "filename filepath"
find $dir2 -type f -printf "%f %p\n" > dir2.lst

# sort filename portion alphabetically and append " 2" as source
cat dir2.lst | while read filename filepath
do
    echo ${filename%.*}
done | sort -u | while read filename 
do
    echo "$filename 2"
done > dir2base.lst

# now merge the two lists together; sort them by filename first and by source second
last=""
sort -k 1 -k 2  -u dir1base.lst dir2base.lst | while read filename dir
do
    # if we find an entry from source 2 that had the same filename from source 1 just before we have a match
    if [[ $dir -eq 2 && $last == $filename ]] ; then
        # output match
        echo $filename
    fi
    last=$filename
done | while read match 
do
    # now grep for matches in both original files
    echo "*** File(s)"
    grep $match dir1.lst
    echo "*** match file(s)"
    grep $match dir2.lst
    echo "---"
done

Context

StackExchange Code Review Q#37511, answer score: 3

Revisions (0)

No revisions yet.