patternMinor
Improving clarity and style of this "Document at a time" algorithm
Viewed 0 times
thisclaritytimealgorithmstyledocumentandimproving
Problem
I've created a module that implements a basic "Document at a time" algorithm for inverted index querying (more background here). There are many algorithm improvements possible, but I'm now interested in feedback on the structure of the Erlang code, specifically:
```
-module(search_daat).
-export([get_top/2]).
-record(
heap,
{
heap :: list({integer(), integer()}),
min_score :: integer()
}
).
-record(
postings,
{
docs :: list({integer(),integer()}),
max_score :: integer(),
name :: string()
}
).
-spec get_top(
integer(),
list(#postings{})
) -> #heap{}.
get_top(N, Postings) ->
get_top(N, Postings, #heap{ heap = [], min_score = 0}, 0, 0).
-spec get_top(
integer(),
list(#postings{}),
#heap{},
integer(),
integer()
) -> #heap{}.
get_top(N, Postings, Heap, CurrentDocId, UpperBound) ->
Result = next(Postings, CurrentDocId, UpperBound),
case Result of
{error, not_found} ->
Heap;
{{NextDocId, _Score}, NewPostings} ->
FullScore = full_score(NextDocId, NewPostings),
NewHeap = heap_insert(Heap, N, NextDocId, FullScore),
get_top(N, NewPostings, NewHeap, NextDocId, NewHeap#heap.min_score)
end.
-spec heap_insert(#heap{}, integer(), integer(), integer()) -> #heap{}.
heap_insert(Heap, MaxHeapSize, DocId, FullScore) ->
case length(Heap#heap.heap)
Heap1 = Heap#heap.heap ++ [{DocId, FullScore}],
Heap2 = [{_,MinScore} | _Rest] = lists:keysort(2, Heap1),
Heap#heap{ heap = Heap2, min_score = MinScore};
false ->
case Heap#heap.min_score > FullScore of
true ->
Heap;
false ->
Heap1 = lists:keydelete(
- Unneeded verbosity / code
- Erlang 'style' mistakes
- Things that can be done more elegantly
- (and yes, it's missing documentation)
```
-module(search_daat).
-export([get_top/2]).
-record(
heap,
{
heap :: list({integer(), integer()}),
min_score :: integer()
}
).
-record(
postings,
{
docs :: list({integer(),integer()}),
max_score :: integer(),
name :: string()
}
).
-spec get_top(
integer(),
list(#postings{})
) -> #heap{}.
get_top(N, Postings) ->
get_top(N, Postings, #heap{ heap = [], min_score = 0}, 0, 0).
-spec get_top(
integer(),
list(#postings{}),
#heap{},
integer(),
integer()
) -> #heap{}.
get_top(N, Postings, Heap, CurrentDocId, UpperBound) ->
Result = next(Postings, CurrentDocId, UpperBound),
case Result of
{error, not_found} ->
Heap;
{{NextDocId, _Score}, NewPostings} ->
FullScore = full_score(NextDocId, NewPostings),
NewHeap = heap_insert(Heap, N, NextDocId, FullScore),
get_top(N, NewPostings, NewHeap, NextDocId, NewHeap#heap.min_score)
end.
-spec heap_insert(#heap{}, integer(), integer(), integer()) -> #heap{}.
heap_insert(Heap, MaxHeapSize, DocId, FullScore) ->
case length(Heap#heap.heap)
Heap1 = Heap#heap.heap ++ [{DocId, FullScore}],
Heap2 = [{_,MinScore} | _Rest] = lists:keysort(2, Heap1),
Heap#heap{ heap = Heap2, min_score = MinScore};
false ->
case Heap#heap.min_score > FullScore of
true ->
Heap;
false ->
Heap1 = lists:keydelete(
Solution
I'm not sure writing explicit type specifications is idiomatic Erlang. In all programming languages that support inference, type specifications are used in specific cases, eg. when you want to restrict the possibles types, but not by default. Even if the doc on specifications explain a few advantages, the Erlang programming rules don't even mention it. Make sure you know why you're using them!
Please document your records: types help but I also want to know that the first integer is the document id, and that the second integer is the score, and I want to know it in the record definition. :) Also, watch out your indentation for the last brace.
Names:
In
If you changed
-record(
postings,
{
docs :: list({integer(),integer()}),
max_score :: integer(),
name :: string()
}
).Please document your records: types help but I also want to know that the first integer is the document id, and that the second integer is the score, and I want to know it in the record definition. :) Also, watch out your indentation for the last brace.
Names:
heap describes the container, but not what it contains: be more specific. The same applies for next, I think you could name it nextposting or next_posting, but I'm not sure. Names are hard!In
full_score, I wouldn't alter the logic of your foldl call just to remove the score of DocId. I would pass the whole tuple to full_score, and remove the score after the sum has been made. More generally, it makes sense to have function interfaces that get one or multiple 'document' rather than just document ids just because you need that right now. Defining a specific document type, for example, would help too.If you changed
full_score to get a full document including its score, you can call lists:map to retrieve scores and lists:sum to get the sum. That would probably be easier to read but would require two passes: that's up to you!Code Snippets
-record(
postings,
{
docs :: list({integer(),integer()}),
max_score :: integer(),
name :: string()
}
).Context
StackExchange Code Review Q#35679, answer score: 4
Revisions (0)
No revisions yet.