HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

File parser convert to node data and neighbouring relational data

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileneighbouringparsernodeconvertrelationalanddata

Problem

I have been working on a file parser that takes a very specific file format and parses it into a list that is arranged into node data and the neighbors that it relates to. I am new to Python (this is my very first program in this language), so I am not familiar with more advanced methods of solving the problem using Python.

The program runs very quickly: I get an average of about Elapsed time: 0.0006923562137193841 with a test file, but think it could be even better, especially if I task it with a significantly larger file.

Seeking from question:

  • Optimization in the form of cleaner methods



  • Decrease the overall runtime



  • Verification of estimated runtime: \$O(N * E)\$. I got this because there are N nodes, which each contain E edges. However, this may or may not be incorrect.



  • General style comments for Python coding



Input file example:

The following would be 1 line of data in the file. This file could contain thousands of lines, each line identifying a node and the neighbors that it has.

100  Alpha  123  321  ((101,Beta,123,321)(101,Gamma,123,321)(102,Alpha,123,321)(103,Alpha,123,321)(104,Beta,123,321)(105,Alpha,123,321)(099,Gamma,123,321)(098,Beta,123,321)(097,Beta,123,321)(222,Gamma,123,321)(223,Beta,123,321)(234,Gamma,123,321)(451,Beta,123,321)(999,Beta,123,321)(879,Gamma,123,321)(369,Gamma,123,321)(741,Beta,123,321)(753,Beta,123,321)(357,Beta,123,321)(159,Alpha,123,321))


The parsing would end with the line containing only "At the last row".

```
import os
import timeit

__author__ = 'Evan Bechtol'

"""
Parses through a file given the appropriate filepath. Once a filepath
has been received, the Parser instance opens and begins parsing out the
individual nodes and their neighbor node relationships. Each node is an
index of the nodeList, which contains a sub-list of the nodes that are neighbors
of that node.

The structure is created as follows:
nodeList: A list that is 'n' nodes long. The sub-list containing neighbors is of

Solution

Style

  • Use snake_case for functions and variables.



  • Leave 1 line between methods.



  • Don't overwrite keywords, id, file. Use synonyms or id_, the former is preferred.



  • Keep lines to a maximum of 79. (Comments should be a maximum of 72 however.)



  • Avoid excessive white-space.



  • Use one space between both sides of the assignment operator. E.G. a = 1, not a = 1.



As @Quill said docstrings are good.
And you should write them instead of some of your comments in your code.
Algorithms

setNodeData Can be shortened to just:

def setNodeData(self, id_, sector, freq, band, neighborList):
    return ((id_), (sector), (freq), (band), (neighborList))


searchForStartLine should use the builtin enumerate.

def searchForStartLine(self, data):
    start_line = "-default-  -  -  -  -  "
    for line_num, line in enumerate(data):
        if start_line in line:
            return line_num


splitLine is only used once. Also you overwrite line and don't use it after the overwrite.
Consider removing it.

extractNeighbors can be simplified to a list comprehension, and could be simpler that way.
It is also faster than appending to an existing list.

I first used the * operator on the slice [:4]. This way you don't have to define sector, etc.
I then removed the need for i, there is no need for it, as everything uses neighbor[i].

def extractNeighbors(self, neighbors):
    return [
            self.setNeighborData(
                *(neighbor.replace(",", " ").split()[:4])
            )
            for neighbor in neighbors
            ]


We can do roughly the same thing above to extractNode, to minimise code.

First remove all the noise of sector etc. We will use *(splitLine[:5]).
You only make changes to splitLine[4], and the other informaton just clutters that.

def extractNode(self, splitLine):
    splitLine[4] = self.extractNeighbors(
        self.removeParens(splitLine[4]).split()
        )
    self.numNeighbors.append(len(splitLine[4]))
    self.nodeList.append(self.setNodeData(*(splitLine[:5])))


In readFile you should change the for loop.

for line in data:


You don't use line, and you do currentLine += 1.

for line_num, _ in enumerate(data, currentLine):
    if lastLine in data[line_num + 1]:
        break
    else:
        nodeId = data[0]
        self.splitLine(data[line_num])


It may be better to use range however.

Overall, this will have no effect on the speed.
But it highlights that str.split and str.replace are probably the reasons that it is slow.
This is as they are, if I recall correctly, linear time.
You could try using the optional argument of str.split, maxsplit, to not go through the entire string.
And you wouldn't need to use [:4] and [:5].

The only other way I can see speeding this up would be to use a different algorithm to handle each line in roughly O(n).
I know that we aren't allowed to write compleat rewrites so here is an example I wrote.

  • There are no speed optimisations.



  • Same as 1. But try str.split's maxsplit.



  • I can't comment.



  • It's throughout.

Code Snippets

def setNodeData(self, id_, sector, freq, band, neighborList):
    return ((id_), (sector), (freq), (band), (neighborList))
def searchForStartLine(self, data):
    start_line = "-default-  -  -  -  -  "
    for line_num, line in enumerate(data):
        if start_line in line:
            return line_num
def extractNeighbors(self, neighbors):
    return [
            self.setNeighborData(
                *(neighbor.replace(",", " ").split()[:4])
            )
            for neighbor in neighbors
            ]
def extractNode(self, splitLine):
    splitLine[4] = self.extractNeighbors(
        self.removeParens(splitLine[4]).split()
        )
    self.numNeighbors.append(len(splitLine[4]))
    self.nodeList.append(self.setNodeData(*(splitLine[:5])))
for line in data:

Context

StackExchange Code Review Q#96026, answer score: 7

Revisions (0)

No revisions yet.