HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Find and report all occurrences of duplicate lines in a text file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileoccurrencesallduplicatetextreportfindandlines

Problem

I would like you to take a look at this simple script I wrote that's supposed to emulate the Unix sort [FILE] | uniq -cd command. What's different about my script though is that it also lists line numbers telling the user where occurrences of duplicate lines are located in the file. Please, tell me what you think as well as what parts of the script, if any, you think I should rework to make it better.

################################################################################
# File name: uniq.awk
# ===================
#
# Find and report all occurrences of duplicate lines in a text file.
#
#
# Usage: awk -f uniq.awk [FILE]
#
################################################################################
{
    x = lines[$0]["count"]++; # Count the number of occurrences of a line
    lines[$0]["NR"][x] = NR;  # Also save the number lines

    # Find the length of the longest line to make it the column width
    if (x > 0) {
        if (length($0) > max) {
            max = length($0);
        }
    }
}
END {
    # If the file contains no lines to process, that is, it's empty,
    # return an exit status code of 1 to indicate the fact.
    if (!(NR > 0)) {
        exit 1;
    }

    # Prepare the format string
    # Column #1: number of occurrences of the line
    # Column #2: line itself
    # Column #3: line numbers where all the lines are located
    fmt_s = "%s: %" max "-s (%s)\n";

    for (i in lines) {
        if (lines[i]["count"] > 1) {
            for (j = 0; j < lines[i]["count"]; j++) {
                s = s lines[i]["NR"][j] ", ";
            }
            # Get rid of the trailing comma and space
            s = substr(s, 1, length(s) - 2);
            printf(fmt_s, lines[i]["count"], i, s);
            s = "";
        }
    }
}


Test:

```
$ cat > data
car
baby
car
man
woman
woman
key
woman
$
$ cat -n data
1 car
2 baby
3 car
4 man
5 woman
6 woman
7 key
8 woman
$
$ awk -f uniq.awk data
2:

Solution

I don't know about Awk, but for this section of code:

END {
# If the file contains no lines to process, that is, it's empty,
# return an exit status code of 1 to indicate the fact.
if (!(NR > 0)) {
    exit 1;
}


You should print out an appropriate message to the terminal instead of just exiting with exit code 1. This, in my opinion, provides users with better feedback.

Code Snippets

END {
# If the file contains no lines to process, that is, it's empty,
# return an exit status code of 1 to indicate the fact.
if (!(NR > 0)) {
    exit 1;
}

Context

StackExchange Code Review Q#139689, answer score: 2

Revisions (0)

No revisions yet.