patternpythonMinor

Migrating HTML elements

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

migratinghtmlelements

Problem

I have this code which migrates HTML elements to the word chunks by bracketing each element. Here, chunks is the list of word chunks to be processed and dom is to access the given HTML source. This returns a list of processed word chunks.

Could you suggest any improvements to this :

def _migrate_html(self, chunks, dom):
    elements = self._get_elements_list(dom)
    for element in elements:
      result = []
      index = 0
      concat_chunks = []
      for chunk in chunks:
        if (index + len(chunk.word) <= element.index or
            element.index + len(element.text) <= index):
          result.append(chunk)
        elif (index <= element.index and
              element.index + len(element.text) <= index + len(chunk.word)):
          result.append(Chunk(
              chunk.word.replace(element.text, element.source),
              HTML_POS, HTML_POS, True))
        elif (index < element.index + len(element.text) and
              element.index + len(element.text) <= index + len(chunk.word)):
          concat_chunks.append(chunk)
          new_word = u''.join([c_chunk.word for c_chunk in concat_chunks])
          new_word = new_word.replace(element.text, element.source)
          result.append(Chunk(new_word, HTML_POS, HTML_POS, True))
          concat_chunks = []
        else:
          concat_chunks.append(chunk)
        index += len(chunk.word)
      chunks = result
    return chunks

Solution

result points to the variable being the actual result being returned, but it's actually the current state of processing. Even new_chunks would be a better name.

Some expressions like index + len(chunk.word) or element.index + len(element.text) repeat themselves, and would be better as named variables describing their meaning. This also makes advancing the index inside the for more readable.

index_after_chunk = index + len(chunk.word)

index_after_element = element.index + len(element.text)

Advancing the index: index = index_after_chunk

You should be using Python's chained comparisons for the last if:

if index




EDIT:

On further examination it seems you're actually checking for intersection of index ranges. Let's say we have this:

from collections import namedtuple
class IndexRange(namedtuple('IndexRange', ['start', 'length'])):
  @property
  def end(self): return self.start + self.length

  # more helper methods here

chunk_range = IndexRange(start=index, length=len(chunk.word))
element_range = IndexRange(start=element.index, length=len(element.text))


You can now cleanly express the logic as (I'm inventing methods here but you get the point):

if chunk_range.disjoint_from(element_range):




first case, ranges are disjoint

elif chunk_range.fully_contains(element_range):




second case, element_range is fully contained in chunk_range

elif chunk_range.contains_point(element_range.end):




third case, element_range starts before chunk_range and ends inside_chunk_range

else:




final case, element_range starts inside chunk_range and ends after chunk_range OR element_range fully contains chunk_range



If I got the cases right, then

You should at least add them as comments explaining what each branch means



Possibly justify the final case in the comment

Possibly reorder the cases to better express your meaning. You don't have to rely on elif`, it's not a big deal if you re-test something you've tested before as long as it's readable and encapsulated in a helper method or function.

Code Snippets

from collections import namedtuple
class IndexRange(namedtuple('IndexRange', ['start', 'length'])):
  @property
  def end(self): return self.start + self.length

  # more helper methods here


chunk_range = IndexRange(start=index, length=len(chunk.word))
element_range = IndexRange(start=element.index, length=len(element.text))

Context

StackExchange Code Review Q#141307, answer score: 3

Revisions (0)

No revisions yet.