HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Python algorithm mimicking mark/sweep garbage collection

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
sweepcollectionalgorithmmimickingpythongarbagemark

Problem

I don't have much experience with Python and would love some feedback on my implementation of this program.

I am given a file with an integer representing a number of memory-cells, followed by a list of ordered pairs representing links between either a pointer and a memory-cell, or memory-cell and a memory-cell. You could use the list to draw diagram akin to those used in text-book "which cells are marked and which are reclaimed" problems regarding mark/sweep GC.

I was tasked with creating a Python script that takes in such a list, and runs a simulated mark/sweep on it and my solution (below) is fully functional.

Here's an example input file:

6
ptr1,0
0,1
1,2
ptr2,3
3,2
4,5


And here is the script:

`import sys
import re

def RepresentsInt(s):
try:
int(s)
return True
except ValueError:
return False

#recursive function that traces through a mem-node structure
#setting each node's check_flag to 1 if traversed
def node_trace(pointer, node_pointer_list, nodes_dict):
if RepresentsInt(pointer[0]):
if nodes_dict[pointer[0]][1] == 1:
#returns if the current node has been traversed already
return
else:
#checks the current node as marked and traversed
nodes_dict[pointer[0]][0] = 1
nodes_dict[pointer[0]][1] = 1

for x in node_pointer_list:
#recurs for each node the current node points to
if x[0] == pointer[1] and nodes_dict[x[0]][1] == 0:
node_trace(x, node_pointer_list, nodes_dict)

#catches the node at the end of a chain
if RepresentsInt(x[0]):
if nodes_dict[x[0]][1] == 1:
nodes_dict[x[1]][0] = 1
nodes_dict[x[1]][1] = 1
else:
#catches a single-link chain from a variable to a mem-block
nodes_dict[x[1]][0] = 1
nodes_dict[x[1]][1] = 1

sysArgs = list(sys.argv)
inputFile = sysArgs[1]
inFile = open(inputFile)
node_pointer_list = []
nod

Solution

There are several suboptimal areas of this solution that are interrelated. Please feel free to ask questions if anything I've written here is unclear.

The solution is not "modeling" memory and then performing mark-and-sweep as it probably should be. Instead it is preserving the data-input file in node_pointer_list and repeatedly walking it to find what is needed, because of your for x in node_pointer_list: in node_trace(). This is causing the solution to unnecessarily iterate over the node_pointer_list inside each call to node_trace(), which is making its running time O(N^2) when it should be O(N). It is also making the code 2-3x longer than it should be. Further, the solution is quite hard to read.

I'll try to break down those issues one at a time, though keep in mind they are all interrelated. I'm intentionally explaining rather than providing code, so you can do the learning work to make a better solution. I'll start by talking about code readability.

The way you use if not RepresentsInt(item[0]): to determine if a line is a "pointer" line is very non-intuitive. It would be easier to read if the literal "ptr" appeared somewhere near where you are testing if it is a line containing "ptr", such as:

if item[0].find("ptr") == 0:


Why do you have both "marked" and "traversed" flags? They contain the same information.

A code line like this below, is very hard for anyone to read the meaning of. Someone reading the code is forced to dig around to find out the meaning of the zeroes and ones to understand it.

nodes_dict[x[1]][0] = 1


To make it more readable, you need to get the word "TRAVERSED" in there somewhere, since what you are doing is setting a traversed flag. If you stick with your two element arrays, you can make a TRAVERSED_FLAG symbol to make it more readable, as in:

nodes_dict[x[1]][TRAVERSED_FLAG] = 1


Better style would be to define a class, and give the class a .traversed field. If you also make some of the other changes I highlight below, this access becomes:

current_object.traversed = 1


I don't know if you are comfortable with classes or not, so I'll provide one explanation based on classes, and one based on arrays...

With Classes

You can put the marked field, and the list of an object's references in a single class that represents a memory position in your heap. The class definition would look something like:

class MemoryObject:
    def __init__(self):
        self.marked = 0
        self.references = []  # list of either  or 


First you would instantiate an array of these for the number of memory-spaces according to your data-file, with an additional one as the RootMemoryObject. Then you would walk the file's references, adding to each MemoryObject's references list. You can either store an int reference in the list, and look it up later, or you can look up the MemoryObject now and store that directly in references. When you see a "ptr" entry, you put that in the RootMemoryObject.

The first step of mark-and-sweep is to walk all memory locations and set .marked=0. Then you would call node_trace(RootMemoryObject), which would mark the object visited, and recurse through the objects referenced by that object's .references. Because you have the list of that object's references, you don't need the inefficient loop for x in node_pointer_list:.

Without Classes

Instead of node_pointer_list, you want an array of lists of references. As you walk the datafile, you would insert each reference into the appropriate list for that pointer number. Further, you want an additional "root_references" list to hold references which appear in "ptr" entries.

To start the recursion, you would do for reference in root_references: node_trace(reference), which would mark the reference visited (returning if it was already visited), then look up each child reference and recursively call node_trace(child_reference) on it. Because you have the list of that object's references, you don't need the inefficient loop for x in node_pointer_list:.

Other Factors

The running time you are shooting for is O(N), where N is the number of nodes in your memory. You are currently O(N^2). After you remove the extra loop in node_trace(), you will be properly O(N) because you already have a check to be sure you're not traversing a set of children more than one.

Using recursion is good. An alternative is to use a work-list. To use a work list, you put the root(s) in a list, and then while the work-list is non-empty, you remove an item from the work-list and call node_trace() on it. node_trace puts any child nodes that need to be visited into the work list, and exits.

Both will have the same O(N) running time, but can potentially have different memory consumption. The recursive version's maximum memory consumption is proportional to the depth of object references. The work-list version's maximum memory consumption is hard to calculate, as it can be

Code Snippets

if item[0].find("ptr") == 0:
nodes_dict[x[1]][0] = 1
nodes_dict[x[1]][TRAVERSED_FLAG] = 1
current_object.traversed = 1
class MemoryObject:
    def __init__(self):
        self.marked = 0
        self.references = []  # list of either <int> or <MemoryObject>

Context

StackExchange Code Review Q#161908, answer score: 4

Revisions (0)

No revisions yet.