HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Keeping just the rare characters in a long string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
keepingthejustlongcharactersrarestring

Problem

I'm working on The Python Challenge. The current code iterates over a 98,766-character string before printing an 8-character string. Current execution time is between 7 and 10 seconds.

How can I optimize my current code and what rookie mistakes am I making that may be adding to the execution time?

import urllib.request
import re

url = 'http://www.pythonchallenge.com/pc/def/ocr.html'

x = urllib.request.urlopen(url).read().decode('utf-8')

text = (re.findall(r'', x, re.DOTALL))[1]

# source contains two adjacent html comments
# returns second comment

s = ''
for i in text:
    if text.count(i) < 10:
        s += i
print(s)

Solution

x is not a good variable name. I suggest page_source.

Your approach is inefficient because you re-count all of the occurrences of each character, which is O(n2), where n is the length of the string.

It would be better to use a Counter. It's more efficient because counts all characters using just one pass through the string. (We call that O(n).)

from collections import Counter

count = Counter(text)
s = ''.join(char for char in text if count[char] < 10)
print(s)


If you wanted to make your own counter with similar performance characteristics, you could do something similar to this:

count = {}
for char in text:
    count[char] = 1 + count.get(char, 0)

Code Snippets

from collections import Counter

count = Counter(text)
s = ''.join(char for char in text if count[char] < 10)
print(s)
count = {}
for char in text:
    count[char] = 1 + count.get(char, 0)

Context

StackExchange Code Review Q#94385, answer score: 4

Revisions (0)

No revisions yet.