patternpythonMinor

Loading a Protein Data Bank file into a numpy matrix

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

matrixfilenumpyintobankproteinloadingdata

Problem

Here is my code:

def read_Coordinates_Atoms2(fileName, only_CA = True):
    '''
    in : PDB file
    out : matrix with coordinates of atoms
    '''
    with open(fileName, 'r') as infile:
        for line in infile :
            if only_CA == True :
                if line.startswith('ATOM') and line[13:15] == 'CA': 
                    try:    # matrix fill-up
                        CoordAtoms = np.vstack([CoordAtoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]]) # np.append
                    except NameError:  # matrix declaration
                        CoordAtoms = np.array([[line[30:38],line[38:46], line[46:54]]], float) 
            else : 
                if line.startswith('ATOM'):
                    try:    # matrix fill-up
                        CoordAtoms = np.vstack([CoordAtoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]]) # np.append
                    except NameError:  # matrix declaration
                        CoordAtoms = np.array([[line[30:38],line[38:46], line[46:54]]], float)              
        return CoordAtoms

Is there a more efficient way to do this ? I mean, a way where I don't have to write twice the same lines? I think the code should look more like this :

def foo(file, condition2 = True):
    if condition1 and condition2 :
        # do lots of instructions
    elif condition1 :
        # do the same lots of instructions (but different output)

Solution

Seeing that both your blocks are identical, you can be able to merge them using boolean logic.

First thing is that, in each case, you perform line.startswith('ATOM') so put that first.

Second, either you have only_CA being True and you need 'CA' at line[13:15] too, or you have only_CA being False. In other words, you keep the line if either only_CA is False or 'CA' is at line[13:15].

This lets you rewrite your for loop as:

for line in infile:
    if line.startswith('ATOM') and (not only_CA or line[13:15] == 'CA'):
        try:    # matrix fill-up
            CoordAtoms = np.vstack([CoordAtoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]]) # np.append
        except NameError:  # matrix declaration
            CoordAtoms = np.array([[line[30:38],line[38:46], line[46:54]]], float)

You can also extract out the line parsing at it is somehow repeated:

for line in infile:
    if line.startswith('ATOM') and (not only_CA or line[13:15] == 'CA'):
        data = [line[30:38], line[38:46], line[46:54]]
        try:    # matrix fill-up
            CoordAtoms = np.vstack([CoordAtoms, [float(x) for x in data]]) # np.append
        except NameError:  # matrix declaration
            CoordAtoms = np.array([data], float)

But you can also simplify the whole thing by converting your data to float before the try and feeding np.array data of the correct type:

for line in infile:
    if line.startswith('ATOM') and (not only_CA or line[13:15] == 'CA'):
        data = [float(line[begin:end]) for begin, end in ((30, 38), (38, 46), (46, 54))]
        try:    # matrix fill-up
            CoordAtoms = np.vstack([CoordAtoms, [data]]) # np.append
        except NameError:  # matrix declaration
            CoordAtoms = np.array([data])

Code Snippets

for line in infile:
    if line.startswith('ATOM') and (not only_CA or line[13:15] == 'CA'):
        try:    # matrix fill-up
            CoordAtoms = np.vstack([CoordAtoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]]) # np.append
        except NameError:  # matrix declaration
            CoordAtoms = np.array([[line[30:38],line[38:46], line[46:54]]], float)

for line in infile:
    if line.startswith('ATOM') and (not only_CA or line[13:15] == 'CA'):
        data = [line[30:38], line[38:46], line[46:54]]
        try:    # matrix fill-up
            CoordAtoms = np.vstack([CoordAtoms, [float(x) for x in data]]) # np.append
        except NameError:  # matrix declaration
            CoordAtoms = np.array([data], float)

for line in infile:
    if line.startswith('ATOM') and (not only_CA or line[13:15] == 'CA'):
        data = [float(line[begin:end]) for begin, end in ((30, 38), (38, 46), (46, 54))]
        try:    # matrix fill-up
            CoordAtoms = np.vstack([CoordAtoms, [data]]) # np.append
        except NameError:  # matrix declaration
            CoordAtoms = np.array([data])

Context

StackExchange Code Review Q#144429, answer score: 4

Revisions (0)

No revisions yet.