HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Load recurring (but not strictly identical) sets of Key, Values into a DataFrame from text files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fromstrictlyintobuttextidenticalrecurringfilesloadvalues

Problem

I am reading text files that contain data from observations. The format is not Fixed Width or Delimited, so I built a generator that gets Key, Values pairs and yields a dictionary when it has read a full observation record (~75 pairs in my case). The main loop builds a list from these dictionaries and then loads them into a DataFrame.

This code, although working, is slow and I realize it may be not optimum to build a long list of dictionaries to finally load into a DataFrame.

Side notes:

  • I am not a developer and new to Python



  • I choose to use Dictionaries because my data files may contain different number of observations (Key, Value pairs) for each Record, but generally, this number should be consistent within one data file



  • I do not control the text files format, I have to live with it...



  • The text files format is vaguely resembling JSON with each Record ([ ] enclosed) consisting of some (Key, Value) pairs and nested Sub-Records both indented by 2 spaces, (Key, Value) pairs within Sub-Records are indented by 4 spaces and [ ] enclosed.



  • Key and Values are separated by [space]:[space]



  • Values can span multiple line, in this case the line ends with 4 spaces and the following lines are indented by more than 4 spaces, but this can be broken if the Value contains a LF character.



Is there a more efficient way to load my Key, Value dictionaries into my DataFrame?

```
import glob
import pandas as pd

raw_ext = '.raw'
raw_path = input('RAW Files Folder path: ')

OBREP = '=' * 15
SBREP = '[\n'
KVSEP = ' : '
VCONT = ' \n'
VENDS = ' \n'

def get_rec_dict(file):
recs = {}
f = False
for line in file:
if OBREP in line:
f = True
if KVSEP in line and not line.endswith(SBREP):
vlist = line.split(KVSEP)
k = vlist.pop(0).strip()
if line.endswith(VCONT):
for line in file:
vlist.append(line.strip())

Solution

I think you should refactor things about. raw_ext should be RAW_EXT to clarify that it's a constant. Then raw_path should be down with raw_files, to make it clear that it's a user defined value. I might even make a function for get_raw_files(), like this:

def get_raw_files():
    raw_path = input('RAW Files Folder path: ')
    return glob.glob('{0}*{1}'.format(raw_path, raw_ext))


This makes it easier to test individual parts of your code, and update them too if you realise that you need to test the user's input, for example like this:

def get_raw_files():
    while True:
        raw_path = input('RAW Files Folder path: ')
        if os.path.isdir(raw_path):
            break
        print('Path not found, please check that it exists')

    return glob.glob('{0}*{1}'.format(raw_path, raw_ext))


get_rec_dict is a very long function, separating it into individual tasks would make it much more readable.

It's good that you've clearly marked some strings as constants, but the names are terribly unclear. They're clearly shortened words, but that makes me unable to gather what they're supposed to mean. Sure KVSEP is probably a separator, but of what? Try to make them more clear, even if it involves longer verbose lines.

In get_rec_dict, your f value seems pointless. Wouldn't it just be the same to end your loops with if OBREP in line? line is never modified in your loop, so it'll give the same result whether you test at the start or end of an iteration. If there is a realy difference between what you have and what I suggest, then it's hacky and unclear. And you should clarify it with a comment. You should also combine if tests, rather than nesting them:

if OBREP in line and recs:
    yield recs


You can use map to run a function on every value of a list, so instead of this:

v = '\n'.join(val.strip() for val in vlist)


you can do this:

v = '\n'.join(map(str.strip, vlist))


It's a little better performance wise, and easier to read.

I may be wrong, but I believe you could just call list(get_rec_dict(infile)) directly, rather than looping over the result and appending each value. Your loop doesn't allow breaking or catch errors, so I can't see any difference with it apart from inefficiency.

rec_list += list(get_rec_dict(infile))


It also seems silly to have file_no when all you care about is having more than one file in the raw_files list. Use Python's truthiness for this test instead. An empty list can be evaluated as False by Python, while a list that contains elements will be read as True.

if raw_files:
    print('{} RAW files loaded.'.format(file_no))
else:
    print('No file found.')

Code Snippets

def get_raw_files():
    raw_path = input('RAW Files Folder path: ')
    return glob.glob('{0}*{1}'.format(raw_path, raw_ext))
def get_raw_files():
    while True:
        raw_path = input('RAW Files Folder path: ')
        if os.path.isdir(raw_path):
            break
        print('Path not found, please check that it exists')

    return glob.glob('{0}*{1}'.format(raw_path, raw_ext))
if OBREP in line and recs:
    yield recs
v = '\n'.join(val.strip() for val in vlist)
v = '\n'.join(map(str.strip, vlist))

Context

StackExchange Code Review Q#116960, answer score: 2

Revisions (0)

No revisions yet.