HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Determine how many unique numbers there are in a text file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
uniquefilehowarenumberstextdeterminemanythere

Problem

I have a program that has one job: to determine how many unique numbers there are in a text file. The text file contains a string of numbers created by this other program named Randomizer.py:

# Import random module
import random

# Variables for holding and transporting data
values = []
val_str = ""

def randomize():
    # Create original numbers
    ran_int = random.randint(500, 699)
    for num in range(0, ran_int):
        values.append(num)

    # Create duplicates
    ran_int_two = random.randint(250, 300)
    for duplicate in range(0, ran_int_two):
        val = random.choice(values)
        values.append(val)

    # Shuffle values... literally
    random.shuffle(values)

    list_transfer()

def list_transfer():
    global val_str
    # Place values in file
    text_file = open("text_input", "w")
    for index in range(0, len(values)):
        val_str += (str(values[index]) + " ")
    text_file.write(repr(val_str))
    text_file.close()


It creates a random count of unique numbers then creates a random count of duplicate numbers. All get sent to a text file called text_file. Example output in file (3,630 chars long):

```
'206 428 186 321 648 612 171 418 447 565 355 430 250 567 314 526 360 151 171 595 452 397 195 416 54 604 485 335 460 35 633 479 555 39 557 574 320 610 20 127 64 556 552 230 56 234 336 207 548 634 244 507 247 332 350 564 241 586 169 94 277 58 332 42 607 65 419 558 27 144 216 311 229 57 179 152 341 333 156 448 597 580 66 641 453 360 376 178 385 595 296 197 391 505 110 563 275 597 362 560 266 396 531 510 260 518 57 190 100 293 553 501 623 256 579 226 63 334 70 647 294 492 307 592 528 488 436 591 464 626 633 541 211 585 264 398 283 577 366 254 491 220 261 626 469 91 636 290 550 598 315 476 90 533 583 227 14 627 305 494 466 262 362 183 472 451 468 65 393 450 603 609 362 362 322 356 118 575 478 20 335 206 323 579 21 261 109 380 89 508 533 516 484 44 473 168 335 31 604 261 509 547 464 462 80 93 383 402 103 565 48 218 49 155 135 143 547

Solution

Your first script can be greatly shortened. There is also no need for the global variable, at all.

import random

def randomized_values():
    # Create original numbers
    values = range(random.randint(500, 699))
    # Create duplicates
    for _ in range(random.randint(250, 300)):
        values.append(random.choice(values))
    random.shuffle(values)
    return values

def save(values, file_name):
    with open(file_name, "w") as text_file:
        text_file.write(" ".join(map(str, values)))

if __name__ == "__main__":
    values = randomized_values()
    save(values, "text_input")


Here I created the first list of values directly by casting range to a list and using the fact that range starts at 0 by default.

If you want to have exactly random.randint(250, 300) duplicates (instead of mostly duplicates and some triplicates and...), you can use values.extend(random.sample(values, random.randint(250, 300))) instead of the for loop.

I got rid of all variables that were only used once. I used the single responsibility principle, so every function only does one thing and (if possible) returns its result.

I used with..as to ensure that the file will be closed even in case of an exception. I used str.join to build the output string and got rid of the repr, because the first thing the second script does is get rid of the surrounding ''. I made the script more adaptable by making the filename a parameter.

Finally, I used if __name__ == "__main__": to allow importing parts if this module from another script.

Now to your actual script. There seems to be no need to actually save the file to the disk, it would be better to just use values = randomizer.randomized_values(). But I'm going to go with the flow of your program and keep it for now. Note that I renamed the file to be lower_case, because that is how variables should be named in Python, according to its official style-guide, PEP8.

import time
import randomizer
from collections import Counter

FILE_NAME = "text_input"

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        print "The function {.__name__} took {:.2f} seconds to finish.".format(func, time.time() - start)
        return result
    return wrapper

def get_uniques(occurences):
    return sum(1 for n in occurences.itervalues() if n == 1)

@timeit
def main():
    # Create new data values
    values = randomizer.randomized_values()

    # The next two steps are actually unnecessary
    randomizer.save(values, FILE_NAME) 
    with open(FILE_NAME) as text_file:
        occurences = Counter(text_file.read().split())

    # Would also work:
    # occurences = Counter(values)

    total = len(occurences)
    unique = get_uniques(occurences)
    duplicates = total - unique

    print "There were {} unique values and {} duplicates. There were {} total values.".format(unique, duplicates, total)

if __name__ == "__main__":
    main()


Note again, that saving the values to a file and reading them again is completely unnecessary.

In order to get the number of occurences for every number, I used collections.Counter, which consumes an iterable and notes how often every element in that iterable occurs and returns a dictionary with this information.

Note that all the data cleaning has become now obsolete, since the saving just saves the values (without the surrounding ''). I again used with..as to ensure that the file is correctly closed.

In get_uniques_duplicates I used dict.itervalues, which is a generator of the values of the dict, which takes O(1) memory, in comparison to dict.values, which takes O(n) memory in Python 2.

The result is now printed using str.format, even though this was not so necessary here.

The approach for counting all unique/duplicated items is a generator expression, which means it is implemented in C internally. I use the fact that all not unique elements have duplicates.

I used a decorator to measure the time it took the function to run. Note the function here, not the whole program per se. But since I used it on main it is effectively the whole program (minus the initialization, imports and function definitions).

Edit:
I just realized, I might have misunderstood your requirements. If you are looking for the number of distinct numbers in values, use this:

values = random.randomized_values()
unique = len(set(values))
total = len(values)
duplicates = total - unique


My original code would give 1 unique (as in, appears exactly once) value for values = [1, 1, 2]. This code will give 2 distinct values.

Code Snippets

import random


def randomized_values():
    # Create original numbers
    values = range(random.randint(500, 699))
    # Create duplicates
    for _ in range(random.randint(250, 300)):
        values.append(random.choice(values))
    random.shuffle(values)
    return values


def save(values, file_name):
    with open(file_name, "w") as text_file:
        text_file.write(" ".join(map(str, values)))

if __name__ == "__main__":
    values = randomized_values()
    save(values, "text_input")
import time
import randomizer
from collections import Counter

FILE_NAME = "text_input"


def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        print "The function {.__name__} took {:.2f} seconds to finish.".format(func, time.time() - start)
        return result
    return wrapper


def get_uniques(occurences):
    return sum(1 for n in occurences.itervalues() if n == 1)


@timeit
def main():
    # Create new data values
    values = randomizer.randomized_values()

    # The next two steps are actually unnecessary
    randomizer.save(values, FILE_NAME) 
    with open(FILE_NAME) as text_file:
        occurences = Counter(text_file.read().split())

    # Would also work:
    # occurences = Counter(values)

    total = len(occurences)
    unique = get_uniques(occurences)
    duplicates = total - unique

    print "There were {} unique values and {} duplicates. There were {} total values.".format(unique, duplicates, total)

if __name__ == "__main__":
    main()
values = random.randomized_values()
unique = len(set(values))
total = len(values)
duplicates = total - unique

Context

StackExchange Code Review Q#150412, answer score: 6

Revisions (0)

No revisions yet.