HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

Counting unique visitors in access log

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
uniquecountinglogvisitorsaccess

Problem

I'm aware I am probably reinventing the wheel somewhat here, but I am trying to teach myself simple bash coding by completing simple tasks such as parsing files.

To that end I am looking to learn what elephants in the room I could be missing or if there are better ways to use the core functionality of bash without installing any additional tools.

This simple code returns a list of unique IP addresses that have hit the index of my site along with a count of the hits.

a="access.log"; for b in $(cat $a | awk '{print $1}' | sort | uniq);do echo $b;grep $a -e "GET / HTTP" | grep -c $b;done;


Assumptions:
access.log is in the current directory and is in the regular format

Any advice or suggestions for improvement are greatly appreciated

Solution

Well, your code is hardly a bash solution, is it? You use sort, awk, grep, and echo....

Additionally, your code is dumped on a single line, and it makes it hard to read. Why not put it in a script, and have separate commands on separate lines.... like:

#!/bin/bash

a="access.log"
for b in $(cat $a | awk '{print $1}' | sort | uniq); do
  echo $b;
  grep $a -e "GET / HTTP" | grep -c $b;
done;


Those variable names.... ouch. a and b make it hard to separate from the -c and -e too.... and they mean nothing otherwise. Why not use meaningful names like ip and log?

Then, when I ran the code, I got a lot of funny results.... like:

54.69.125.145
1
61.240.144.65
0
64.14.99.254
0
66.196.235.78
0
66.249.64.188
0
74.208.152.232
0


Why are there 0 counts.... oh, that's because those are IP's that are not accessing the home page, but are accessing other pages... they appear as $b but don't actually "GET" /.

I would consider making it more a study of bash and use the native bash structures to get things right.... no grep, awk, etc.

#!/bin/bash

# use first commandline argument if supplied
log="access.log"
if [ $1 ] ; then
    log="$1"
fi

# set a variable to match in a regular expression
match="GET / HTTP"

# create a named array.
declare -A counts

# read the file line-by-line
while IFS='' read -r line || [[ -n "$line" ]]; do

  # find lines that access GET / HTTP
  if [[ $line =~ $match ]] ; then

    # get just the IP of the client
    ip=${line%% *}
    # get the previous count, default to 0
    prev=${counts[$ip]:-0}
    # increment the count for this IP
    counts[$ip]=$(($prev + 1))
  fi
done < "$log"

for ip in "${!counts[@]}" ; do
    echo "IP $ip visited ${counts[$ip]} times"
done


EDIT: About the ${line%% } variable substitution. The possibilities when doing variables in bash are remarkably powerful. I recommend looking at the document Parameter Substitution for details, and the man page for bash is good as well (but does not have the examples). The %% token indicates that there should be a pattern search backward from the end of $line for a space ` followed by any characters (the ` - this is a "glob" expression, not a regex). This pattern essentially looks for the first space, and removes it and any charaters after it. The man page document says:

${parameter%word}
${parameter%%word}

    The word is expanded to produce a pattern just as in filename expansion.
    If the pattern matches a trailing portion of the expanded value of
    parameter, then the result of the expansion is the value of parameter
    with the shortest matching pattern (the ‘%’ case) or the longest matching
    pattern (the ‘%%’ case) deleted. If parameter is ‘@’ or ‘*’, the pattern
    removal operation is applied to each positional parameter in turn, and the
    expansion is the resultant list. If parameter is an array variable
    subscripted with ‘@’ or ‘*’, the pattern removal operation is applied to
    each member of the array in turn, and the expansion is the resultant list.

Code Snippets

#!/bin/bash

a="access.log"
for b in $(cat $a | awk '{print $1}' | sort | uniq); do
  echo $b;
  grep $a -e "GET / HTTP" | grep -c $b;
done;
54.69.125.145
1
61.240.144.65
0
64.14.99.254
0
66.196.235.78
0
66.249.64.188
0
74.208.152.232
0
#!/bin/bash

# use first commandline argument if supplied
log="access.log"
if [ $1 ] ; then
    log="$1"
fi

# set a variable to match in a regular expression
match="GET / HTTP"

# create a named array.
declare -A counts

# read the file line-by-line
while IFS='' read -r line || [[ -n "$line" ]]; do

  # find lines that access GET / HTTP
  if [[ $line =~ $match ]] ; then

    # get just the IP of the client
    ip=${line%% *}
    # get the previous count, default to 0
    prev=${counts[$ip]:-0}
    # increment the count for this IP
    counts[$ip]=$(($prev + 1))
  fi
done < "$log"

for ip in "${!counts[@]}" ; do
    echo "IP $ip visited ${counts[$ip]} times"
done
${parameter%word}
${parameter%%word}

    The word is expanded to produce a pattern just as in filename expansion.
    If the pattern matches a trailing portion of the expanded value of
    parameter, then the result of the expansion is the value of parameter
    with the shortest matching pattern (the ‘%’ case) or the longest matching
    pattern (the ‘%%’ case) deleted. If parameter is ‘@’ or ‘*’, the pattern
    removal operation is applied to each positional parameter in turn, and the
    expansion is the resultant list. If parameter is an array variable
    subscripted with ‘@’ or ‘*’, the pattern removal operation is applied to
    each member of the array in turn, and the expansion is the resultant list.

Context

StackExchange Code Review Q#141270, answer score: 5

Revisions (0)

No revisions yet.