HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

Shell script to count chess game outcomes

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptoutcomesshellchessgamecount

Problem

I came across this blog post by Adam Drake from around a year ago which is now making the rounds again.

I made some improvements to his code, but wish to see if there are additional tweaks that could be made to make it run even faster.

The task is to extract chess game results from PGN files. The files contain sequences of games, where each has a header which contains a "Result" line like this:

[Result "1-0"]
[Result "0-1"]
[Result "1/2-1/2"]


These three results indicate a white win, a black win, and a draw, respectively. The task is to simply collect and report a summary of these results.

Here is my solution to be reviewed:

find . -type f -name '*.pgn' -print0 |
 xargs -0 mawk -F '[-"]' '/Result/ { ++a[$2]; }
   END { print a["1"]+a["0"]+a["1/2"], a["1"], a["0"], a["1/2"] }'


I was skeptical of using find over just listing the files in the reference data set, but my timings indicate that this is actually faster than a shell wildcard (Bash 4.3.11(1)-release).

tripleee@xvbvntv:ChessData$ time find . -type f -name '*.pgn' | wc -l
3025

real    0m0.014s
user    0m0.008s
sys     0m0.011s

tripleee@xvbvntv:ChessData$ time printf '%s\n' */*.pgn | wc -l
3025

real    0m0.037s
user    0m0.032s
sys     0m0.010s


The optimization I originally had in mind was to close the data file after reading the Result line, but as it turns out, the reference data set files contain multiple games, and thus multiple results (and the game portion is a lot smaller than I thought it would be).

tripleee@xvbvntv:ChessData$ time find . -type f -name '*.pgn' -print0 |
> xargs -0 mawk -F '[-"]' '/Result/ { ++a[$2]; }
>   END { print a["1"]+a["0"]+a["1/2"], a["1"], a["0"], a["1/2"] }'
6829065 2602614 1974505 2251946

real    0m50.232s
user    0m19.820s
sys     0m2.542s


This is as far as I got. (An earlier version, based on the blog post, attempted parallel processing, but removing that was the biggest performance improvement I made.) I don't think swit

Solution

I don't see the reason for chaining with -print0 | xargs -0.
It's better and simpler to use -exec:

find . -type f -name '*.pgn' -exec mawk -F '[-"]' '...' {} +


I don't see a way to make the AWK code faster,
but:

  • Some of the double-quoting is unnecessary



  • I would add a space around operators for somewhat better readability



  • A semicolon can be dropped



Like this:

/Result/ { ++a[$2] }
   END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }

Code Snippets

find . -type f -name '*.pgn' -exec mawk -F '[-"]' '...' {} +
/Result/ { ++a[$2] }
   END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }

Context

StackExchange Code Review Q#78086, answer score: 3

Revisions (0)

No revisions yet.