HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Generating frequency tables based on CSV dataset

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
tablesfrequencycsvdatasetgeneratingbased

Problem

I am working on a project which crunches plain text files (.lst).

The name of the file names (fileName) are important because I'll extract node (e.g. abessijn) and component (e.g. WR-P-E-A) from them into a dataframe.

Examples:

abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst


Each file consists of one or more line. Each line consists of a sentence (inside ` tags). Example (abessijn.WR-P-E-A.lst)

Vooral mijn abessijn ruikt heerlijk kruidig .. : ) )
Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .


From each line I extract the sentence, do some small modifications to it, and call it
sentence. Up next is an element called leftContext, which takes the first part of the split between node (e.g. abessijn) and the sentence it came from. Finally, from leftContext I get precedingWord, which is the word preceding node in sentence, or the right most word in leftContext (with some limitations such as the option of a compound formed with a hyphen). Example:

ID | filename | node | component | precedingWord | leftContext | sentence
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid ( adapter ) ,
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter ,
4 airbag.WS-U-E-A.lst

Solution

Speed comparisons are always a good thing, but it can be tricky to determine what is actually being compared. I think it's premature to decide that the comparison is "Python" vs. "R" without a lot of work to verify that all the libraries you use and functions you write are reasonably optimized for each language.

One of the strengths of python is a pretty good set of line profiling tools. I like line_profiler, because it works in IPython notebooks, a format I find convenient for "real time" coding and tooling around.

I used line profiling to run a (Python 2 version of) your code, and I found that one single line in your script was responsible for 94.5% of the execution time. Can you guess what it is? It was this one:

df = df.append([dict(fileName=fn, component=c, 
                                       precedingWord=pw, node=n, 
                                       leftContext=lc, sentence=s)])


I'm not a veteran of pandas but I think it's safe to say that building a data frame row-by-row in pandas is not very efficient.

How did I refactor your code to run in Python 2 and ran line_profiler on it? Essentially I just wrapped everything you wrote into a dummy function called run_code() and called the line profiler on that function. For Python 2:

  • I had to modify how you were parsing your directory (I use os.listdir()instead of grob stuff)



  • I had to import codec because Python 2 isn't natively able to Unicode the same way that Python 3 is.



The bottom-line results: your code took about 25.6 seconds (on my machine) to run, but simply filling the Pandas dataframe by column after parsing was done instead of row-by-row during parsing took 1.2 seconds. This simple modification led to speedup of more than 20×! You could probably get even faster by pre-allocating Numpy structured arrays for each column, and then using those to fill a dataframe by column.

In addition to the timing issue, there are a number of other aspects of your code that you may want to consider revising:

-
Variable names are confusing (what is s, etc.) and don't always follow PEP8. (Avoid camelCase and use snake_case instead, etc.)

-
For simple timing, consider the timeit module instead of datetime. (Of course, the main point of my answer is that line profiling is essential to figure out where the slow parts of your code are, be sure to use more than simple timing commands when you are optimizing...but there are certainly times where simple timing is useful, and timeit is a module engineered for that task.

-
You should factor your code into smaller functions that each accomplish a task. For example, one function could be generating a list of filenames to parse. Another function could parse the data and return a dataframe, and a third could find the frequencies.

In the end, it's tough to know what your original speed comparison means. If your original Python implementation of this script is based off of a direct translation from R, then it probably means that Pandas sucks at filling dataframes by row. But even if that is true, its unclear if Pandas is "slower" than R, because being aware of the row-by-row limitation, you should be able to easily work around it in almost every forseeable use case. (Can anyone think of an example where filling a dataframe row by row is essential and it can't be done any other way?)

Thanks for asking a fun question!

Here's all the code I used.

import os
import re
from datetime import datetime

import numpy as np
import pandas as pd

from glob import glob

# unicode file parsing support in Python 2.x
import codecs

# get unescape to work in Python 2.x
import HTMLParser
unescape = HTMLParser.HTMLParser().unescape

# line profiling
%load_ext line_profiler
import timeit


```
def run_code():

start_time = datetime.now()

# Create empty dataframe with correct column names
column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)

# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), subdir))

# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")

p_sentence = re.compile(r"(.*?)")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")

p_last_word = re.compile(r"^.\b(?<!-)(\w+(?:-\w+))[^\w]*$", re.U)

# Loop files in folder
filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]

for filename in filenames:
with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
[n, c] = p_filename.split(filename.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:

Code Snippets

df = df.append([dict(fileName=fn, component=c, 
                                       precedingWord=pw, node=n, 
                                       leftContext=lc, sentence=s)])
import os
import re
from datetime import datetime

import numpy as np
import pandas as pd

from glob import glob

# unicode file parsing support in Python 2.x
import codecs

# get unescape to work in Python 2.x
import HTMLParser
unescape = HTMLParser.HTMLParser().unescape

# line profiling
%load_ext line_profiler
import timeit
def run_code():

    start_time = datetime.now()

    # Create empty dataframe with correct column names
    column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
    df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)

    # Create correct path where to fetch files
    subdir = "rawdata"
    path = os.path.abspath(os.path.join(os.getcwd(), subdir))

    # "Cache" regex
    # See http://stackoverflow.com/q/452104/1150683
    p_filename = re.compile(r"[./\\]")

    p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
    p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
    p_non_graph = re.compile(r"[^\x21-\x7E\s]")
    p_quote = re.compile(r"\"")
    p_ellipsis = re.compile(r"\.{3}(?=[^ ])")

    p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)

    # Loop files in folder
    filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]

    for filename in filenames:
        with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
            [n, c] = p_filename.split(filename.lower())[-3:-1]
            fn = ".".join([n, c])
            for line in f:
                uline = unicode(line)
                s = p_sentence.search(unescape(uline)).group(1)
                s = s.lower()
                s = p_typography.sub("", s)
                s = p_non_graph.sub("", s)
                s = p_quote.sub("'", s)
                s = p_ellipsis.sub("... ", s)

                if n in re.split(r"[ :?.,]", s):
                    lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]

                    pw = p_last_word.sub("\\1", lc)

                    df = df.append([dict(fileName=fn, component=c, 
                                       precedingWord=pw, node=n, 
                                       leftContext=lc, sentence=s)])
                continue

    # Reset indices
    df.reset_index(drop=True, inplace=True)

    # Export dataset
    df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")

    # Let's make a frequency list
    # Create new dataframe

    # Define neuter and non_neuter
    neuter = ["het"]
    non_neuter = ["de"]

    # Create crosstab
    df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
    df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
    df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"

    freqDf = pd.crosstab(df.node, df.gender)

    freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")

    # How long has the script been running?
    time_difference = datetime.now() - start_time
    print("Time difference of", time_difference)
    return
%lprun -f run_code run_code()
Timer unit: 1e-06 s

Total time: 25.6168 s
File: <ipython-input-5-b8823da4f6a5>
Function: run_code at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def run_code():
     2                                           
     3         1           10     10.0      0.0      start_time = datetime.now()
     4                                           
     5                                               # Create empty dataframe with correct column names
     6         1            2      2.0      0.0      column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
     7         1          384    384.0      0.0      df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
     8                                           
     9                                               # Create correct path where to fetch files
    10         1            2      2.0      0.0      subdir = "rawdata"
    11         1          119    119.0      0.0      path = os.path.abspath(os.path.join(os.getcwd(), subdir))
    12                                           
    13                                               # "Cache" regex
    14                                               # See http://stackoverflow.com/q/452104/1150683
    15         1          265    265.0      0.0      p_filename = re.compile(r"[./\\]")
    16                                           
    17         1          628    628.0      0.0      p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
    18         1          697    697.0      0.0      p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
    19         1          411    411.0      0.0      p_non_graph = re.compile(r"[^\x21-\x7E\s]")
    20         1          128    128.0      0.0      p_quote = re.compile(r"\"")
    21         1          339    339.0      0.0      p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
    22                                           
    23         1         1048   1048.0      0.0      p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
    24                                           
    25                                               # Loop files in folder
    26       108         1122     10.4      0.0      filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
    27                                               
    28       108          250      2.3      0.0      for filename in filenames:
    29       107         5341     49.9      0.0          with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
    30       107          867      8.1      0.0              [n, c] = p_filename.split(filename.lower())[-3:-1]
    31       107          277      2.6      0.0              fn = ".".join([n, c])
    32      6607       395024     59.8      1.5              for line i

Context

StackExchange Code Review Q#101648, answer score: 7

Revisions (0)

No revisions yet.