HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Mean of many subsets of a dataframe

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dataframemeansubsetsmany

Problem

I have large dataframe containing many replicates.
The replicates are in groups of 3. So the first 3 replicates are in column 1, 2 and 3. The second set 4, 5 and 6... and so on.

Now I create a new dataframe containing for each set of replicates the mean.

The code below works, but it is really clumpy and especially the cbind and the collumname setting is really ugly.

# first i create the new dataframe
data.mean<- data.frame(matrix(nrows=30))

# iterate over every third collumn
for(col in seq(1,length(colnames(data)), by=3)){

    # create a subset from the dataframe and compute the mean of the rows and finally cbind it to the result dataframe
    data.mean <-cbind(data.mean,apply(subset(data, select=seq(col,length.out =   3)),1,mean, na.rm = TRUE))

    # setting the new collumn name to the colname from the old dataset (name of the first replicate)
    colnames(data.mean)[ncol(data.mean)] <- colnames(data)[col]
}


I really want to improve my R coding style so i am happy about every tip!

Solution

Here is a proposal for a different approach that doesn't use a for loop and has some simplifications.

First, an example data frame:

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8


Now, we set the number of columns per group:

# number of columns per group (1-3, 4-6)
n <- 3


Based on this information, some necessary information can be calculated:

# number of groups
n_grp <- ncol(dat) / n
# 2

# column indices (one vector per group)
idx_grp <- split(seq(dat), rep(seq(n_grp), each = n))
# Here is a proposal for a different approach that doesn't use a for loop and has some simplifications.

First, an example data frame:

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8


Now, we set the number of columns per group:

# number of columns per group (1-3, 4-6)
n <- 3


Based on this information, some necessary information can be calculated:

1` # [1] 2 3 4 # # Here is a proposal for a different approach that doesn't use a for loop and has some simplifications.

First, an example data frame:

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8


Now, we set the number of columns per group:

# number of columns per group (1-3, 4-6)
n <- 3


Based on this information, some necessary information can be calculated:

2` # [1] 5 6 7


In the next step, lapply is used to calculate the row means of each group. This is much more convenient with the rowMeans function.

# calculate the row means for all groups
res <- lapply(idx_grp, function(i) {
    # subset of the data frame
    tmp <- dat[i]
    # calculate row means
    rowMeans(tmp, na.rm = TRUE)
})
# Here is a proposal for a different approach that doesn't use a for loop and has some simplifications.

First, an example data frame:

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8


Now, we set the number of columns per group:

# number of columns per group (1-3, 4-6)
n <- 3


Based on this information, some necessary information can be calculated:

# number of groups
n_grp <- ncol(dat) / n
# 2

# column indices (one vector per group)
idx_grp <- split(seq(dat), rep(seq(n_grp), each = n))
# Here is a proposal for a different approach that doesn't use a for loop and has some simplifications.

First, an example data frame:

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8


Now, we set the number of columns per group:

# number of columns per group (1-3, 4-6)
n <- 3


Based on this information, some necessary information can be calculated:

1` # [1] 2 3 4 # # Here is a proposal for a different approach that doesn't use a for loop and has some simplifications.

First, an example data frame:

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8


Now, we set the number of columns per group:

# number of columns per group (1-3, 4-6)
n <- 3


Based on this information, some necessary information can be calculated:

2` # [1] 5 6 7


In the next step, lapply is used to calculate the row means of each group. This is much more convenient with the rowMeans function.

1` # [1] 4.666667 5.666667 6.666667 # # Here is a proposal for a different approach that doesn't use a for loop and has some simplifications.

First, an example data frame:

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8


Now, we set the number of columns per group:

# number of columns per group (1-3, 4-6)
n <- 3


Based on this information, some necessary information can be calculated:

# number of groups
n_grp <- ncol(dat) / n
# 2

# column indices (one vector per group)
idx_grp <- split(seq(dat), rep(seq(n_grp), each = n))
# Here is a proposal for a different approach that doesn't use a for loop and has some simplifications.

First, an example data frame:

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8


Now, we set the number of columns per group:

# number of columns per group (1-3, 4-6)
n <- 3


Based on this information, some necessary information can be calculated:

1` # [1] 2 3 4 # # Here is a proposal for a different approach that doesn't use a for loop and has some simplifications.

First, an example data frame:

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8


Now, we set the number of columns per group:

# number of columns per group (1-3, 4-6)
n <- 3


Based on this information, some necessary information can be calculated:

2` # [1] 5 6 7


In the next step, lapply is used to calculate the row means of each group. This is much more convenient with the rowMeans function.

2` # [1] 3.333333 4.333333 5.333333


The command above returns a list. It can be transformed into a data frame:

# transform list into a data frame
dat2 <- as.data.frame(res)
#         X1       X2
# 1 4.666667 3.333333
# 2 5.666667 4.333333
# 3 6.666667 5.333333


In order to set the column names of the new data frame, we first have to extract the column names of the groups' first columns.

# extract names of first column of each group
names_frst <- names(dat)[sapply(idx_grp, "[", 1)]
# [1] "a1" "b1"


Now, these names are used for the new data frame:

# modify column names of new data frame
names(dat2) <- names_frst
#         a1       b1
# 1 4.666667 3.333333
# 2 5.666667 4.333333
# 3 6.666667 5.333333


Done.

Code Snippets

dat <- data.frame(a1 = 9:11, a2 = 2:4, a3 = 3:5,
                  b1 = 4:6, b2 = 5:7, b3 = 1:3)
#   a1 a2 a3 b1 b2 b3
# 1  1  2  3  4  5  6
# 2  2  3  4  5  6  7
# 3  3  4  5  6  7  8
# number of columns per group (1-3, 4-6)
n <- 3
# number of groups
n_grp <- ncol(dat) / n
# 2

# column indices (one vector per group)
idx_grp <- split(seq(dat), rep(seq(n_grp), each = n))
# $`1`
# [1] 2 3 4
#
# $`2`
# [1] 5 6 7
# calculate the row means for all groups
res <- lapply(idx_grp, function(i) {
    # subset of the data frame
    tmp <- dat[i]
    # calculate row means
    rowMeans(tmp, na.rm = TRUE)
})
# $`1`
# [1] 4.666667 5.666667 6.666667
#
# $`2`
# [1] 3.333333 4.333333 5.333333
# transform list into a data frame
dat2 <- as.data.frame(res)
#         X1       X2
# 1 4.666667 3.333333
# 2 5.666667 4.333333
# 3 6.666667 5.333333

Context

StackExchange Code Review Q#58523, answer score: 7

Revisions (0)

No revisions yet.