snippetpythonMinor
File parser convert to node data and neighbouring relational data
Viewed 0 times
fileneighbouringparsernodeconvertrelationalanddata
Problem
I have been working on a file parser that takes a very specific file format and parses it into a list that is arranged into node data and the neighbors that it relates to. I am new to Python (this is my very first program in this language), so I am not familiar with more advanced methods of solving the problem using Python.
The program runs very quickly: I get an average of about
Seeking from question:
Input file example:
The following would be 1 line of data in the file. This file could contain thousands of lines, each line identifying a node and the neighbors that it has.
The parsing would end with the line containing only
```
import os
import timeit
__author__ = 'Evan Bechtol'
"""
Parses through a file given the appropriate filepath. Once a filepath
has been received, the Parser instance opens and begins parsing out the
individual nodes and their neighbor node relationships. Each node is an
index of the nodeList, which contains a sub-list of the nodes that are neighbors
of that node.
The structure is created as follows:
nodeList: A list that is 'n' nodes long. The sub-list containing neighbors is of
The program runs very quickly: I get an average of about
Elapsed time: 0.0006923562137193841 with a test file, but think it could be even better, especially if I task it with a significantly larger file.Seeking from question:
- Optimization in the form of cleaner methods
- Decrease the overall runtime
- Verification of estimated runtime: \$O(N * E)\$. I got this because there are
Nnodes, which each containEedges. However, this may or may not be incorrect.
- General style comments for Python coding
Input file example:
The following would be 1 line of data in the file. This file could contain thousands of lines, each line identifying a node and the neighbors that it has.
100 Alpha 123 321 ((101,Beta,123,321)(101,Gamma,123,321)(102,Alpha,123,321)(103,Alpha,123,321)(104,Beta,123,321)(105,Alpha,123,321)(099,Gamma,123,321)(098,Beta,123,321)(097,Beta,123,321)(222,Gamma,123,321)(223,Beta,123,321)(234,Gamma,123,321)(451,Beta,123,321)(999,Beta,123,321)(879,Gamma,123,321)(369,Gamma,123,321)(741,Beta,123,321)(753,Beta,123,321)(357,Beta,123,321)(159,Alpha,123,321))The parsing would end with the line containing only
"At the last row".```
import os
import timeit
__author__ = 'Evan Bechtol'
"""
Parses through a file given the appropriate filepath. Once a filepath
has been received, the Parser instance opens and begins parsing out the
individual nodes and their neighbor node relationships. Each node is an
index of the nodeList, which contains a sub-list of the nodes that are neighbors
of that node.
The structure is created as follows:
nodeList: A list that is 'n' nodes long. The sub-list containing neighbors is of
Solution
Style
As @Quill said docstrings are good.
And you should write them instead of some of your comments in your code.
Algorithms
Consider removing it.
It is also faster than appending to an existing list.
I first used the
I then removed the need for
We can do roughly the same thing above to
First remove all the noise of
You only make changes to
In
You don't use
It may be better to use
Overall, this will have no effect on the speed.
But it highlights that
This is as they are, if I recall correctly, linear time.
You could try using the optional argument of
And you wouldn't need to use
The only other way I can see speeding this up would be to use a different algorithm to handle each line in roughly O(n).
I know that we aren't allowed to write compleat rewrites so here is an example I wrote.
- Use
snake_casefor functions and variables.
- Leave 1 line between methods.
- Don't overwrite keywords,
id,file. Use synonyms orid_, the former is preferred.
- Keep lines to a maximum of 79. (Comments should be a maximum of 72 however.)
- Avoid excessive white-space.
- Use one space between both sides of the assignment operator. E.G.
a = 1, nota = 1.
As @Quill said docstrings are good.
And you should write them instead of some of your comments in your code.
Algorithms
setNodeData Can be shortened to just:def setNodeData(self, id_, sector, freq, band, neighborList):
return ((id_), (sector), (freq), (band), (neighborList))searchForStartLine should use the builtin enumerate.def searchForStartLine(self, data):
start_line = "-default- - - - - "
for line_num, line in enumerate(data):
if start_line in line:
return line_numsplitLine is only used once. Also you overwrite line and don't use it after the overwrite.Consider removing it.
extractNeighbors can be simplified to a list comprehension, and could be simpler that way.It is also faster than appending to an existing list.
I first used the
* operator on the slice [:4]. This way you don't have to define sector, etc.I then removed the need for
i, there is no need for it, as everything uses neighbor[i].def extractNeighbors(self, neighbors):
return [
self.setNeighborData(
*(neighbor.replace(",", " ").split()[:4])
)
for neighbor in neighbors
]We can do roughly the same thing above to
extractNode, to minimise code.First remove all the noise of
sector etc. We will use *(splitLine[:5]).You only make changes to
splitLine[4], and the other informaton just clutters that.def extractNode(self, splitLine):
splitLine[4] = self.extractNeighbors(
self.removeParens(splitLine[4]).split()
)
self.numNeighbors.append(len(splitLine[4]))
self.nodeList.append(self.setNodeData(*(splitLine[:5])))In
readFile you should change the for loop.for line in data:You don't use
line, and you do currentLine += 1.for line_num, _ in enumerate(data, currentLine):
if lastLine in data[line_num + 1]:
break
else:
nodeId = data[0]
self.splitLine(data[line_num])It may be better to use
range however.Overall, this will have no effect on the speed.
But it highlights that
str.split and str.replace are probably the reasons that it is slow.This is as they are, if I recall correctly, linear time.
You could try using the optional argument of
str.split, maxsplit, to not go through the entire string.And you wouldn't need to use
[:4] and [:5].The only other way I can see speeding this up would be to use a different algorithm to handle each line in roughly O(n).
I know that we aren't allowed to write compleat rewrites so here is an example I wrote.
- There are no speed optimisations.
- Same as 1. But try
str.split's maxsplit.
- I can't comment.
- It's throughout.
Code Snippets
def setNodeData(self, id_, sector, freq, band, neighborList):
return ((id_), (sector), (freq), (band), (neighborList))def searchForStartLine(self, data):
start_line = "-default- - - - - "
for line_num, line in enumerate(data):
if start_line in line:
return line_numdef extractNeighbors(self, neighbors):
return [
self.setNeighborData(
*(neighbor.replace(",", " ").split()[:4])
)
for neighbor in neighbors
]def extractNode(self, splitLine):
splitLine[4] = self.extractNeighbors(
self.removeParens(splitLine[4]).split()
)
self.numNeighbors.append(len(splitLine[4]))
self.nodeList.append(self.setNodeData(*(splitLine[:5])))for line in data:Context
StackExchange Code Review Q#96026, answer score: 7
Revisions (0)
No revisions yet.