HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Configuring fields to be extracted by a data processor

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
extractedprocessorfieldsdataconfiguring

Problem

I solved this problem using closures with Perl and I wonder if/how I could/should have done it differently?

Background:

User creates config file describing properties of some source data file that would be processed:

  • Type of file delimited (and the delim) or fixed width



  • How many fields to expect



  • Which fields are to be pulled out



Examples on the last (if delimited):

Fields: FIRST_NAME 2
Fields: LAST_NAME 3
Fields: ADDRESS 4


or (fixed width):

Fields: FIRST_NAME 58,30
Fields: LAST_NAME  88,30
Fields: ADDRESS    118,50


My Perl reads the config, and prepares ahead the action to take once I start reading the source data file.

Internally, I'm creating preparing my master hash (per the config), which I'll loop on for each record when reading the source.

Now to handle the said examples, I have this code, if the source file would be delimited:

sub ret_col_sub
{
    my ($col_name, $col_pos) = @_

    warn qq($col_name col_pos='$col_pos' is not a number) if $col_pos =~ /\D/;

    $col_pos--;

    # Return closure to caller, which will be called later when we're reading the source data file
    return sub
    {
        my $rec = shift;

        return $rec->[$col_pos];
    };
}


or fixed width:

sub ret_substr_sub
{
    my ($col_name, $substr_params) = @_;

    my ($offset, $len) = $substr_params =~ /(\d+)\s*,\s*(\d+)/;

    # The offset might be ZERO, check for length ... but $len must always be P O S I T I V E !
    warn qq($col_name substr_params='$substr_params' is missing the offset and length needed by substr) if not length $offset or not $len;

    $offset--;
    $len--;

    # Return closure to caller, which will be called later when we're reading the source data file
    return sub
    {
        my $rec = shift;

        return substr $rec, $offset, $len;
    };
}


Is there a better or another way of doing this, or any other suggestions?

Solution

This seems like a desperate attempt to find a use for closures, and I believe your problem would be solved much more simply by making $rec an object, perhaps even a tied object that lets you overload indexing

Some observations:

-
It makes the API much less useful when the calling code has to know which form of data it is dealing with anyway. ret_col_sub returns a closure that expects a simple string for its parameter, while ret_substr_sub returns a closure that requires an array reference. All of that should be hidden from the caller

-
I can see that ret_col_sub won't compile

-
Checking that a string doesn't match \D isn't sufficient to ensure that it is numeric, as an empty string will pass that test

-
It is correct to rebase the substring offset, but a length is a length and that should be left untouched

-
The postfixed ++ and -- operators have become popular because of the name of the C++ language, but unless you are using the value returned by tge operator you should be using tge prefixed version. It is wrong to rely on the compiler to optimise out the storage of the unused original value

-
As has been said, your style contravenes common standards, in particular those laid out in perlstyle which are the most universal

Context

StackExchange Code Review Q#121581, answer score: 3

Revisions (0)

No revisions yet.