patternpythonMinor
Migrating HTML elements
Viewed 0 times
migratinghtmlelements
Problem
I have this code which migrates HTML elements to the word chunks by bracketing each element. Here, chunks is the list of word chunks to be processed and dom is to access the given HTML source. This returns a list of processed word chunks.
Could you suggest any improvements to this :
Could you suggest any improvements to this :
def _migrate_html(self, chunks, dom):
elements = self._get_elements_list(dom)
for element in elements:
result = []
index = 0
concat_chunks = []
for chunk in chunks:
if (index + len(chunk.word) <= element.index or
element.index + len(element.text) <= index):
result.append(chunk)
elif (index <= element.index and
element.index + len(element.text) <= index + len(chunk.word)):
result.append(Chunk(
chunk.word.replace(element.text, element.source),
HTML_POS, HTML_POS, True))
elif (index < element.index + len(element.text) and
element.index + len(element.text) <= index + len(chunk.word)):
concat_chunks.append(chunk)
new_word = u''.join([c_chunk.word for c_chunk in concat_chunks])
new_word = new_word.replace(element.text, element.source)
result.append(Chunk(new_word, HTML_POS, HTML_POS, True))
concat_chunks = []
else:
concat_chunks.append(chunk)
index += len(chunk.word)
chunks = result
return chunksSolution
resultpoints to the variable being the actual result being returned, but it's actually the current state of processing. Evennew_chunkswould be a better name.
- Some expressions like
index + len(chunk.word)orelement.index + len(element.text)repeat themselves, and would be better as named variables describing their meaning. This also makes advancing the index inside theformore readable.
index_after_chunk = index + len(chunk.word)
index_after_element = element.index + len(element.text)
- Advancing the index:
index = index_after_chunk
- You should be using Python's chained comparisons for the last if:
if index
EDIT:
On further examination it seems you're actually checking for intersection of index ranges. Let's say we have this:
from collections import namedtuple
class IndexRange(namedtuple('IndexRange', ['start', 'length'])):
@property
def end(self): return self.start + self.length
# more helper methods here
chunk_range = IndexRange(start=index, length=len(chunk.word))
element_range = IndexRange(start=element.index, length=len(element.text))You can now cleanly express the logic as (I'm inventing methods here but you get the point):
- if chunk_range.disjoint_from(element_range):
- first case, ranges are disjoint
- elif chunk_range.fully_contains(element_range):
- second case, element_range is fully contained in chunk_range
- elif chunk_range.contains_point(element_range.end):
- third case, element_range starts before chunk_range and ends inside_chunk_range
- else:
- final case, element_range starts inside chunk_range and ends after chunk_range OR element_range fully contains chunk_range
If I got the cases right, then
- You should at least add them as comments explaining what each branch means
- Possibly justify the final case in the comment
- Possibly reorder the cases to better express your meaning. You don't have to rely on elif`, it's not a big deal if you re-test something you've tested before as long as it's readable and encapsulated in a helper method or function.
Code Snippets
from collections import namedtuple
class IndexRange(namedtuple('IndexRange', ['start', 'length'])):
@property
def end(self): return self.start + self.length
# more helper methods here
chunk_range = IndexRange(start=index, length=len(chunk.word))
element_range = IndexRange(start=element.index, length=len(element.text))Context
StackExchange Code Review Q#141307, answer score: 3
Revisions (0)
No revisions yet.