patternpythonModerate
Re-arranging an obfuscated address
Viewed 0 times
arrangingobfuscatedaddress
Problem
I'm getting address (physical address, not digital) input that's obfuscated, and looks like the following:
The plaintext version:
The obfuscated version:
Usually the obfuscation is simple duplication and rearranging, which my script catches, however there are a few edge cases that are missed, which I'm working on catching.
However, my solution feels like it could be simplified.
The logic follows this order:
The plaintext version:
'39 Jerrabomberra Ave. Narrabundah Canberra 2604 Australia'The obfuscated version:
['39 Jerrabomberra Ave., Narrabundah', 'Canberra', ' ', '2604', ', ', 'Australia', '39 Jerrabomberra Ave., Narrabundah', 'Canberra 2604, ', 'Australia']Usually the obfuscation is simple duplication and rearranging, which my script catches, however there are a few edge cases that are missed, which I'm working on catching.
However, my solution feels like it could be simplified.
The logic follows this order:
- Join the array into one long string with a space as the 'glue' character.
- Use
re.subto find all commas and remove them.
- Split by space
- Add each non-empty component to the components array if it is not already in there.
- Join the components together.
import re
...
address = fooGetAddress(foo[bar]) #returns an array
address_components = []
for component in re.sub(",", "", " ".join(address)).split(" "):
if component not in address_components and component is not "":
address_components.append(component)
address = " ".join(address_components)Solution
Not bad. However, we can do away with
A better way to check
An
That's exactly what we want. (Well, almost exactly. What we really want is an ordered set, but we can just use the keys of an
We can eliminate the need for
If
To get rid of the commas, you don't need a regular expression.
It's bad practice to use the same variable (
With those changes, we can write the solution as just a single expression.
if component not in address_components and component is not "".A better way to check
if component not in address_components would be to use collections.OrderedDict:An
OrderedDict is a dict that remembers the order that keys were first inserted. If a new entry overwrites an existing entry, the original insertion position is left unchanged.That's exactly what we want. (Well, almost exactly. What we really want is an ordered set, but we can just use the keys of an
OrderedDict and ignore the values.)We can eliminate the need for
component is not "" by using str.split() instead of str.split(" "):If
sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].To get rid of the commas, you don't need a regular expression.
str.replace() will do.It's bad practice to use the same variable (
address) for two different purposes, especially when the type changes (from a list of strings to a string).With those changes, we can write the solution as just a single expression.
from collections import OrderedDict
obfuscated_address = …
address = ' '.join(
OrderedDict(
(component, None) for component in
' '.join(obfuscated_address).replace(',', '').split()
).keys()
)Code Snippets
from collections import OrderedDict
obfuscated_address = …
address = ' '.join(
OrderedDict(
(component, None) for component in
' '.join(obfuscated_address).replace(',', '').split()
).keys()
)Context
StackExchange Code Review Q#112521, answer score: 10
Revisions (0)
No revisions yet.