patternpythonMinor
Generating frequency tables based on CSV dataset
Viewed 0 times
tablesfrequencycsvdatasetgeneratingbased
Problem
I am working on a project which crunches plain text files (
The name of the file names (
Examples:
Each file consists of one or more line. Each line consists of a sentence (inside `
Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid ( adapter ) ,
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter ,
4 airbag.WS-U-E-A.lst
.lst).The name of the file names (
fileName) are important because I'll extract node (e.g. abessijn) and component (e.g. WR-P-E-A) from them into a dataframe.Examples:
abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst
Each file consists of one or more line. Each line consists of a sentence (inside `
tags). Example (abessijn.WR-P-E-A.lst)
Vooral mijn abessijn ruikt heerlijk kruidig .. : ) )Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .
From each line I extract the sentence, do some small modifications to it, and call it sentence. Up next is an element called leftContext, which takes the first part of the split between node (e.g. abessijn) and the sentence it came from. Finally, from leftContext I get precedingWord, which is the word preceding node in sentence, or the right most word in leftContext (with some limitations such as the option of a compound formed with a hyphen). Example:
ID | filename | node | component | precedingWord | leftContext | sentence---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid ( adapter ) ,
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter ,
4 airbag.WS-U-E-A.lst
Solution
Speed comparisons are always a good thing, but it can be tricky to determine what is actually being compared. I think it's premature to decide that the comparison is "Python" vs. "R" without a lot of work to verify that all the libraries you use and functions you write are reasonably optimized for each language.
One of the strengths of python is a pretty good set of line profiling tools. I like
I used line profiling to run a (Python 2 version of) your code, and I found that one single line in your script was responsible for 94.5% of the execution time. Can you guess what it is? It was this one:
I'm not a veteran of pandas but I think it's safe to say that building a data frame row-by-row in pandas is not very efficient.
How did I refactor your code to run in Python 2 and ran
The bottom-line results: your code took about 25.6 seconds (on my machine) to run, but simply filling the Pandas dataframe by column after parsing was done instead of row-by-row during parsing took 1.2 seconds. This simple modification led to speedup of more than 20×! You could probably get even faster by pre-allocating Numpy structured arrays for each column, and then using those to fill a dataframe by column.
In addition to the timing issue, there are a number of other aspects of your code that you may want to consider revising:
-
Variable names are confusing (what is
-
For simple timing, consider the
-
You should factor your code into smaller functions that each accomplish a task. For example, one function could be generating a list of filenames to parse. Another function could parse the data and return a dataframe, and a third could find the frequencies.
In the end, it's tough to know what your original speed comparison means. If your original Python implementation of this script is based off of a direct translation from R, then it probably means that Pandas sucks at filling dataframes by row. But even if that is true, its unclear if Pandas is "slower" than R, because being aware of the row-by-row limitation, you should be able to easily work around it in almost every forseeable use case. (Can anyone think of an example where filling a dataframe row by row is essential and it can't be done any other way?)
Thanks for asking a fun question!
Here's all the code I used.
```
def run_code():
start_time = datetime.now()
# Create empty dataframe with correct column names
column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"(.*?)")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.\b(?<!-)(\w+(?:-\w+))[^\w]*$", re.U)
# Loop files in folder
filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
for filename in filenames:
with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
[n, c] = p_filename.split(filename.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
One of the strengths of python is a pretty good set of line profiling tools. I like
line_profiler, because it works in IPython notebooks, a format I find convenient for "real time" coding and tooling around. I used line profiling to run a (Python 2 version of) your code, and I found that one single line in your script was responsible for 94.5% of the execution time. Can you guess what it is? It was this one:
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])I'm not a veteran of pandas but I think it's safe to say that building a data frame row-by-row in pandas is not very efficient.
How did I refactor your code to run in Python 2 and ran
line_profiler on it? Essentially I just wrapped everything you wrote into a dummy function called run_code() and called the line profiler on that function. For Python 2: - I had to modify how you were parsing your directory (I use
os.listdir()instead ofgrobstuff)
- I had to
import codecbecause Python 2 isn't natively able to Unicode the same way that Python 3 is.
The bottom-line results: your code took about 25.6 seconds (on my machine) to run, but simply filling the Pandas dataframe by column after parsing was done instead of row-by-row during parsing took 1.2 seconds. This simple modification led to speedup of more than 20×! You could probably get even faster by pre-allocating Numpy structured arrays for each column, and then using those to fill a dataframe by column.
In addition to the timing issue, there are a number of other aspects of your code that you may want to consider revising:
-
Variable names are confusing (what is
s, etc.) and don't always follow PEP8. (Avoid camelCase and use snake_case instead, etc.)-
For simple timing, consider the
timeit module instead of datetime. (Of course, the main point of my answer is that line profiling is essential to figure out where the slow parts of your code are, be sure to use more than simple timing commands when you are optimizing...but there are certainly times where simple timing is useful, and timeit is a module engineered for that task.-
You should factor your code into smaller functions that each accomplish a task. For example, one function could be generating a list of filenames to parse. Another function could parse the data and return a dataframe, and a third could find the frequencies.
In the end, it's tough to know what your original speed comparison means. If your original Python implementation of this script is based off of a direct translation from R, then it probably means that Pandas sucks at filling dataframes by row. But even if that is true, its unclear if Pandas is "slower" than R, because being aware of the row-by-row limitation, you should be able to easily work around it in almost every forseeable use case. (Can anyone think of an example where filling a dataframe row by row is essential and it can't be done any other way?)
Thanks for asking a fun question!
Here's all the code I used.
import os
import re
from datetime import datetime
import numpy as np
import pandas as pd
from glob import glob
# unicode file parsing support in Python 2.x
import codecs
# get unescape to work in Python 2.x
import HTMLParser
unescape = HTMLParser.HTMLParser().unescape
# line profiling
%load_ext line_profiler
import timeit```
def run_code():
start_time = datetime.now()
# Create empty dataframe with correct column names
column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"(.*?)")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.\b(?<!-)(\w+(?:-\w+))[^\w]*$", re.U)
# Loop files in folder
filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
for filename in filenames:
with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
[n, c] = p_filename.split(filename.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
Code Snippets
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])import os
import re
from datetime import datetime
import numpy as np
import pandas as pd
from glob import glob
# unicode file parsing support in Python 2.x
import codecs
# get unescape to work in Python 2.x
import HTMLParser
unescape = HTMLParser.HTMLParser().unescape
# line profiling
%load_ext line_profiler
import timeitdef run_code():
start_time = datetime.now()
# Create empty dataframe with correct column names
column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
for filename in filenames:
with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
[n, c] = p_filename.split(filename.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
uline = unicode(line)
s = p_sentence.search(unescape(uline)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\\1", lc)
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])
continue
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
return%lprun -f run_code run_code()Timer unit: 1e-06 s
Total time: 25.6168 s
File: <ipython-input-5-b8823da4f6a5>
Function: run_code at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def run_code():
2
3 1 10 10.0 0.0 start_time = datetime.now()
4
5 # Create empty dataframe with correct column names
6 1 2 2.0 0.0 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
7 1 384 384.0 0.0 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
8
9 # Create correct path where to fetch files
10 1 2 2.0 0.0 subdir = "rawdata"
11 1 119 119.0 0.0 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
12
13 # "Cache" regex
14 # See http://stackoverflow.com/q/452104/1150683
15 1 265 265.0 0.0 p_filename = re.compile(r"[./\\]")
16
17 1 628 628.0 0.0 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
18 1 697 697.0 0.0 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
19 1 411 411.0 0.0 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
20 1 128 128.0 0.0 p_quote = re.compile(r"\"")
21 1 339 339.0 0.0 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
22
23 1 1048 1048.0 0.0 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
24
25 # Loop files in folder
26 108 1122 10.4 0.0 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
27
28 108 250 2.3 0.0 for filename in filenames:
29 107 5341 49.9 0.0 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
30 107 867 8.1 0.0 [n, c] = p_filename.split(filename.lower())[-3:-1]
31 107 277 2.6 0.0 fn = ".".join([n, c])
32 6607 395024 59.8 1.5 for line iContext
StackExchange Code Review Q#101648, answer score: 7
Revisions (0)
No revisions yet.