patternpythonCriticalCanonical
Shuffle DataFrame rows
Viewed 0 times
rowsshuffledataframe
Problem
I have the following DataFrame:
The DataFrame is read from a CSV file. All rows which have
I would like to shuffle the order of the DataFrame's rows so that all
How can I achieve this?
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
...
20 7 8 9 2
21 10 11 12 2
...
45 13 14 15 3
46 16 17 18 3
...The DataFrame is read from a CSV file. All rows which have
Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.I would like to shuffle the order of the DataFrame's rows so that all
Type's are mixed. A possible result could be:Col1 Col2 Col3 Type
0 7 8 9 2
1 13 14 15 3
...
20 1 2 3 1
21 10 11 12 2
...
45 4 5 6 1
46 16 17 18 3
...How can I achieve this?
Solution
The idiomatic way to do this with Pandas is to use the
The
Note:
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
Here, specifying
Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean
.sample method of your data frame to sample all rows without replacement:df.sample(frac=1)
The
frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means to return all rows (in random order).Note:
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
df = df.sample(frac=1).reset_index(drop=True)
Here, specifying
drop=True prevents .reset_index from creating a column containing the old index entries.Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean
id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:$ python3 -m memory_profiler .\test.py
Filename: .\test.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)Code Snippets
$ python3 -m memory_profiler .\test.py
Filename: .\test.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)Context
Stack Overflow Q#29576430, score: 1619
Revisions (0)
No revisions yet.