HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Making line by line replacements using values from multiple arrays

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fromarrayslinereplacementsusingmultiplevaluesmaking

Problem

I currently have a Python script that parses a file (in this case an XML file) and makes line by line replacements as necessary, depending on the values of multiple arrays. As it stands now the script works great, but feel like it could be a lot better. I'll also need to add few more arrays later on and am worried about performance.

Additionally, the way I currently have it set up doesn't allow for counting and printing the total replacements that have been made. (e.g. "Replaced: xyz -- Made XX replacements") While not an immediate requirement, I'd like to add this functionality in the future.

arrayOne = ["old string one", "new string one"]
arrayTwo = ["old string two", "new string two"]

# Variable 'path' collected from command line input
f = open(path, "r", encoding="utf8")
newFile = open(path.replace(".xml", "-new.xml"), "w", encoding="utf8")

def replace(a,b):
    for data in f:
        for datatype in (arrayOne, arrayTwo):
            data = data.replace(datatype[a], datatype[b])
        newFile.write(data)
    newFile.close()

replace(0,1)

f.close()

Solution

One bug I see is that you perform the string substitutions sequentially, rather than "simultaneously". For example, suppose you had an input file with contents

abc


with substitution code

arrayOne = ["a", "b"]
arrayTwo = ["b", "c"]

…
replace(0, 1)


Most users would expect the output to be bcc. However, your code actually produces ccc, which would probably be considered surprising.

The main problem, though, is that everything is tightly coupled, such that the code is not reusable.

For example, replace() is a closure that captures arrayOne and arrayTwo, as well as file f. It doesn't generalize to perform one substitution or ≥ 3 substitutions. The two parameters that it does accept are more annoying than helpful, as you have indicated that you expect to always call it as replace(0, 1). Furthermore, replace() will only operate on the file that was opened as f at the time that the function was defined.

A better interface would be for replace() to accept a dict of substitutions to be performed. Furthermore, it should just concern itself with string transformation, staying clear of all file operations.

I propose the following solution, which contains some rather advanced concepts.

from functools import partial
import re

def str_replacer(mapping):
    regex = re.compile('|'.join(re.escape(orig) for orig in mapping))
    return partial(regex.sub, lambda match: mapping[match.group(0)])


You would use it like this:

>>> replace = str_replacer({
...     'a': 'b',
...     'b': 'c',
...     'c': 'd',
... })
>>> print(replace('abc'))
bcd


Explanation: Performing "simultaneous" replacements means scanning the input string until any of the "old" strings is encountered. If a substitution is performed, the scanning should recommence at the after where the newly spliced-in string ends.

That's tedious to do using low-level string manipulation, so I've opted to use re.sub() instead. To use re.sub(), I have to do two things

  • Compose a regular expression that includes all of the "old" strings to look for. That's what re.compile(…) does.



  • Create a callback function that specifies what the replacement for each old string should be. That's the lambda. match.group(0) is the old string that was found. We look its corresponding "new" string in the mapping.



However, I don't want str_replacer() to call regex.sub(repl, input_string) just yet. Instead, I want str_replacer() to return a function, that if called, performs the desired replacements. Why? Because you will likely call the resulting function many times if you want to process a file a line at a time, and it would be more efficient to reuse the regex and the lambda. That's what functools.partial() does: it creates a function that accepts an input string; when that function is called, then the regex.sub(lambda match: mapping[match.group(0)], input_string) actually happens.

replace = str_replacer({
    'old string one': 'new string one',
    'old string two': 'new string two',
})

with open(path, "r", encoding="utf8") as in_file, \
     open(path.replace(".xml", "-new.xml"), "w", encoding="utf8") as out_file:
    for line in in_file:
        out_file.write(replace(line))


Now the string replacement logic and the file manipulation logic are cleanly separated!

Code Snippets

arrayOne = ["a", "b"]
arrayTwo = ["b", "c"]

…
replace(0, 1)
from functools import partial
import re

def str_replacer(mapping):
    regex = re.compile('|'.join(re.escape(orig) for orig in mapping))
    return partial(regex.sub, lambda match: mapping[match.group(0)])
>>> replace = str_replacer({
...     'a': 'b',
...     'b': 'c',
...     'c': 'd',
... })
>>> print(replace('abc'))
bcd
replace = str_replacer({
    'old string one': 'new string one',
    'old string two': 'new string two',
})

with open(path, "r", encoding="utf8") as in_file, \
     open(path.replace(".xml", "-new.xml"), "w", encoding="utf8") as out_file:
    for line in in_file:
        out_file.write(replace(line))

Context

StackExchange Code Review Q#47365, answer score: 3

Revisions (0)

No revisions yet.