snippetMinor
Quicksort in SAS for sorting datasets
Viewed 0 times
sortingdatasetsquicksortforsas
Problem
As a challenge for my own amusement, I decided to write a SAS macro for sorting datasets, deliberately avoiding all of the built-in methods (e.g.
Remarks:
Code:
```
option mprint source spool;
%macro quicksort(DSN,OUT,BY, DEBUG=N);
%let KEEP_SYNTAX_HIGHLIGHTING = %nrstr(%mend);
%let LIB = %upcase(%scan(&DSN,1,.));
%let DS = %upcase(%scan(&DSN,-1,.));
%let BY = %upcase(%trim(%sysfunc(compbl(&BY))));
%let BYVARS = %eval(%sysfunc(countc(&BY,%str( )))+1);
%put Sorting &LIB..&DS by &BY into output dataset &OUT;
%if &LIB = &DS %then %let LIB = WORK;
proc sql noprint;
select upcase(NAME), LENGTH, COUNT(*)
into :NNAMES separated by ' ',
:NLENGTHS separated by ' ',
:NCOUNT
from sashelp.vcolumn
where LIBNAME = "&LIB" and MEMNAME = "&DS" and TYPE = 'num'
;
select upcase(NAME), LENGTH, COUNT(*), max(LENGTH)
into :CNAMES separated by ' ',
:CLENGTHS separated by ' ',
:CCOUNT,
:MAXCLENGTH
from sashelp.vcolumn
where LIBNAME = "&LIB" and MEMNAME = "&DS" and TYPE = 'char'
;
select NAME
into :LISTNAMES separated by '--'
from sashelp.vcolumn
where LIBNAME = "&LI
proc sort / proc sql / ordered hash object).Remarks:
- I've kept as much of the logic as possible in data step code, but inevitably there's still quite a bit of macro logic.
linkis more or less equivalent togoto, andreturnis more or less equivalent tocomefromlast link- this was the most convenient way I could find of executing sections of code multiple times with the main data step.
- The code is not recursive - instead, I've used an array to store the ranges for queued up quicksort passes.
- Obviously performance is terrible compared with
proc sort- for any reasonably large dataset this is several orders of magnitude slower - but as far as I can tell the output is at least correct.
Code:
```
option mprint source spool;
%macro quicksort(DSN,OUT,BY, DEBUG=N);
%let KEEP_SYNTAX_HIGHLIGHTING = %nrstr(%mend);
%let LIB = %upcase(%scan(&DSN,1,.));
%let DS = %upcase(%scan(&DSN,-1,.));
%let BY = %upcase(%trim(%sysfunc(compbl(&BY))));
%let BYVARS = %eval(%sysfunc(countc(&BY,%str( )))+1);
%put Sorting &LIB..&DS by &BY into output dataset &OUT;
%if &LIB = &DS %then %let LIB = WORK;
proc sql noprint;
select upcase(NAME), LENGTH, COUNT(*)
into :NNAMES separated by ' ',
:NLENGTHS separated by ' ',
:NCOUNT
from sashelp.vcolumn
where LIBNAME = "&LIB" and MEMNAME = "&DS" and TYPE = 'num'
;
select upcase(NAME), LENGTH, COUNT(*), max(LENGTH)
into :CNAMES separated by ' ',
:CLENGTHS separated by ' ',
:CCOUNT,
:MAXCLENGTH
from sashelp.vcolumn
where LIBNAME = "&LIB" and MEMNAME = "&DS" and TYPE = 'char'
;
select NAME
into :LISTNAMES separated by '--'
from sashelp.vcolumn
where LIBNAME = "&LI
Solution
As far as overall, it looks like a good way to approach this. It's probably not how I would approach it, but given your choices at the start it looks good overall, and it does work. It's very slow, but it's not impossibly slow, and for most uses it's perfectly reasonable.
So, here are my initial critiques. I have peeked around your code some, but haven't done a proper thorough review yet; I'll add those over time.
-
Looking at the macro definition line itself:
I tend to prefer to require named parameters rather than permitting positional parameters, particularly with a complicated call that has a list in it. It's very easy to get this one wrong otherwise.
Also, if this will be used by others, it's good practice to include some information about the parameters.
Example:
Doing it this way will allow Enterprise Guide to show the notes inline in the syntax tips, and also allows readers to more easily see exactly what the macro parameters mean.
I would also recommend using
Finally, I would not require
-
Parameter checking
You should always check your parameters for presence/absence, if they're mandatory parameters (as
-
Debug flag usage
Much, much more of the output from this should be included as optional based on
-
PROC SQL notes
You should definitely always have a clean log in macros like this. That means not doing things like your
Also, your
-
Arrays
I would use temporary arrays here. You don't really need names for your array variables, do you? You're effectively using temporary arrays, so just make them that in actuality.
-
Do loop iterator names
Using
It's also very easy to have a variable name collision with those. It's not uncommon (for example) for someone to not drop
Instead, practice name safety. Use
-
Links
While I'm not an anti-goto extremist, I do think in a macro you can do better than this: you can just write these as their own macros. (Or as FCMP calls, if that's your preference, but since this is already written in the macro language, why not keep it there.) That would simplify your code significantly and make it easier to read the actual datastep, plus avoid link/return which are not that common nowadays.
So, here are my initial critiques. I have peeked around your code some, but haven't done a proper thorough review yet; I'll add those over time.
-
Looking at the macro definition line itself:
%macro quicksort(DSN,OUT,BY, DEBUG=N);I tend to prefer to require named parameters rather than permitting positional parameters, particularly with a complicated call that has a list in it. It's very easy to get this one wrong otherwise.
Also, if this will be used by others, it's good practice to include some information about the parameters.
Example:
%macro quicksort(dsn /*Input Dataset Name (including libname if needed) */,
out /*Output Dataset Name (including libname if needed) */,
By /*By Variables */,
Debug=N /* Debug flag, set to Y to get additional debugging information to the log */)Doing it this way will allow Enterprise Guide to show the notes inline in the syntax tips, and also allows readers to more easily see exactly what the macro parameters mean.
I would also recommend using
data instead of dsn, unless your site always uses dsn, for the input name: that's what people will expect it to be named, since that's what it is on most procs.Finally, I would not require
OUT. This can either be done by adding a default for it to IN, or by some code in the macro. This is, again, the default behavior in SAS, and as such the expectation of the user will likely be that it still behaves this way.-
Parameter checking
You should always check your parameters for presence/absence, if they're mandatory parameters (as
IN and BY are), and for being legal values for those parameters. Otherwise it can be very hard to know why something doesn't work properly. What if IN has a three level name for some reason? -
Debug flag usage
Much, much more of the output from this should be included as optional based on
DEBUG=Y. Right now you get a huge amount of output that's not needed. Almost everything you %put other than perhaps that first note that you're actually quicksorting should be in a &debug=Ygroup.-
PROC SQL notes
You should definitely always have a clean log in macros like this. That means not doing things like your
PROC SQLs that generate NOTE: The query requires remerging summary statistics back with the original data.. Split those queries such that you don't have to remerge - or do it a different way.Also, your
LISTNAMES query is not truly safe: while it will almost always work properly, you are actually counting on the min being before the max, which is technically not guaranteed by SQL. Select these independently, in two queries; or use a data step, where you are guaranteed order (or, you can at least enforce it with by).-
Arrays
I would use temporary arrays here. You don't really need names for your array variables, do you? You're effectively using temporary arrays, so just make them that in actuality.
-
Do loop iterator names
Using
i and j and such for do loops is common, but a bad idea. It's very easy to get into a mess - and hard to tell exactly what an iterator is doing.It's also very easy to have a variable name collision with those. It's not uncommon (for example) for someone to not drop
i after using it as an iterator - which could then cause undesired behavior (as you'll overwrite their i, which they might want). Instead, practice name safety. Use
__ at the beginning of all created variable names, for one; or even safer, __qs_ or similar. For iterator variables, use something that describes the loop or function of the iterator: do __nswap = 1 to ... for the numeric-swapping loop, do __cswap = 1 to ... for the character-swapping loop, etc. That makes it easier for you to know what an iterator variable is supposed to be doing when reading the code - and less likely you will have a name collision with your own variables.-
Links
While I'm not an anti-goto extremist, I do think in a macro you can do better than this: you can just write these as their own macros. (Or as FCMP calls, if that's your preference, but since this is already written in the macro language, why not keep it there.) That would simplify your code significantly and make it easier to read the actual datastep, plus avoid link/return which are not that common nowadays.
Code Snippets
%macro quicksort(DSN,OUT,BY, DEBUG=N);%macro quicksort(dsn /*Input Dataset Name (including libname if needed) */,
out /*Output Dataset Name (including libname if needed) */,
By /*By Variables */,
Debug=N /* Debug flag, set to Y to get additional debugging information to the log */)Context
StackExchange Code Review Q#79952, answer score: 4
Revisions (0)
No revisions yet.