HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Dataset and frequency list generation by looping over files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
loopingfrequencydatasetgenerationfilesandlistover

Problem

I have been told that it would be wise to split up my code into semantically useful blocks. I tried the following but I need feedback.

What I already did:

  • As you can see I create two .csv files, so it seemed logical to have two functions. One for the main dataset, the other for a frequency table



  • I only import modules where I need them, though I did need some globally.



I probably should give fn, c and s better variable names. However, I'm not sure how to name them because they are identical to the column names they will be assigned to (e.g. node = node, sentence = sentence).

So what else can I do to better format my code into useful functions? Do note that I want to put an emphasis on performance, so any edits that decrease performance are discouraged.

```
import os, pandas as pd, numpy as np

from datetime import datetime

start_time = datetime.now()

# Create empty dataframe with correct column names
column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)

# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))

def main_dataset():
import regex as re
from html import unescape
# Loop files in folder
filenames = [name for name in os.listdir(path) if re.match(".*?[.]lst", name)]

# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")

p_sentence = re.compile(r"(.*?)")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")

p_last_word = re.compile(r"^.\b(?<!-)(\w+(?:-\w+))[^\w]*$", re.U)

fn_list = []
c_list = []
pw_list = []
n_list = []
lc_list = []
s_list = []

for filename in filenames:
with open(path + '/' + file

Solution

Styling

Some minor edits, but you should import each module on a new line, especially if you're assigning them different names.

import os
import numpy as np
import pandas as pd


It seems redundant here to assign subdir for just a single use. If there's plans to use it later sure, but as it stands this should be fine.

path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, "rawdata"))


I know you said that you wanted to save on performance, but when you instantly call this function anyway you're always going to run these imports, so they should be at the top with the rest.

import regex as re
from html import unescape


The p_quote regex string is unnecessary when you can just use '"'. Python sees no real difference between strings bound in single or double quotes, meaning you can easily reference them like this.

p_quote = re.compile('"')


Don't use path + '/' + filename to set a file path. Use the os module. It performs certain operations based on the OS Python is being run on. os.path.join(path, filename) will work cross platform.

Also there's no real point to opening the file in r+ mode, as that allows you to read and write but all you ever do is read the file anyway.

As for your naming conundrum, I'd just use the actual column names. They're not going to actually clash since you call the columns with strings so there's a clear delineation from what is a list and what's a string. And what can be clearer than using the same name. Also generally you can leave list out of a variable name except if you actually need to distinguish it from others.

The exception to both in your case of course being that you have a separate list of filenames, so I've given that a different name but you might find a different name more suitable.

filenameDataList.append(filenameData)
components.append(component)
precedingWords.append(precedingWord)
nodes.append(node)
leftContexts.append(leftContext)
sentences.append(sentence)

...

df['sentence'] = sentences


Minor Performance

These are two notes that wont significantly reduce your speed from file to file because they're not affecting the regex but they will save you a fraction of a second per file which can build up if you're running it on many files.

You don't need to set n and c as part of a list, you can just comma separate them and python will understand. And it's faster to not create an unneeded list.

[n, c] = p_filename.split(filename.lower())[-3:-1]


There's also no reason to use join here. It's slower than just string concatenation since it's just two items and you're creating the list for the sake of the join.

fn = n + "." + c


Refactoring

I would put all the imports together as I said above. Also I would put all the regex compilations near the top. They're effectively a constant so it's best to always keep constants near the top of the file rather than in a function where they'll be recreated the exact same (not that this affects your performance). This would mean that you have this all before your data frame is created:

import numpy as np
import os
import pandas as pd
import regex as re

from html import unescape
from datetime import datetime

start_time = datetime.now()

p_filename = re.compile(r"[./\\]")
p_last_word = re.compile(r"^.*\b(?(.*?)")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")


As for splitting up functions, I don't personally think that's necessary as your functions aren't overly long and it's a relatively specific process without a lot of repeating code, so I would keep to the two separate ones you have here.

Code Snippets

import os
import numpy as np
import pandas as pd
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, "rawdata"))
import regex as re
from html import unescape
p_quote = re.compile('"')
filenameDataList.append(filenameData)
components.append(component)
precedingWords.append(precedingWord)
nodes.append(node)
leftContexts.append(leftContext)
sentences.append(sentence)

...

df['sentence'] = sentences

Context

StackExchange Code Review Q#101838, answer score: 2

Revisions (0)

No revisions yet.