patternpythonMinor
Transpose a large matrix in Python3
Viewed 0 times
transposematrixlargepython3
Problem
I want to transpose this matrix input:
And get this output:
I have a matrix in a file with thousands of lines and millions of columns, so I can't read it into memory (i.e. numpy.transpose is not an option). I have written the solution below, which is very memory efficient, but terribly slow.
How can I make it faster? The slow parts are seek and read.
Additional information:
The fields do not have fixed widths, but I know that a field is always an integer or a float and always between 1 and 5 characters and always belonging to the closed interval [0:2]. The fi
1.000 2.00 3.0 4.00
5.00 6.000 7.00000 8.0000000
9.0 10.0 11.0 12.00000And get this output:
1.000 5.00 9.0
2.00 6.000 10.0
3.0 7.00000 11.0
4.00 8.0000000 12.00000I have a matrix in a file with thousands of lines and millions of columns, so I can't read it into memory (i.e. numpy.transpose is not an option). I have written the solution below, which is very memory efficient, but terribly slow.
import sys
import os
def main():
path_in = sys.argv[-1]
path_out = os.path.basename(path_in)+'.transposed'
separator = ' '
d_seek = {}
with open(path_in) as fd_in:
i = 0
print('indexing')
while True:
tell = fd_in.tell()
if fd_in.readline() == '':
break
d_seek[i] = tell
i += 1
print('indexed')
cols2 = rows1 = i
with open(path_in) as fd_in:
line = fd_in.readline()
rows2 = cols1 = len(line.split(separator))
del line
with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
print('transposing')
for row2 in range(rows2):
print('row', row2)
for row1 in range(rows1):
fd_in.seek(d_seek[row1])
s = ''
while True:
char = fd_in.read(1)
if char == separator or char == '\n':
break
s += char
d_seek[row1] += len(s)+1
if row1+1 < rows1:
fd_out.write('{} '.format(s))
else:
fd_out.write('{}\n'.format(s))
return
if __name__ == '__main__':
main()How can I make it faster? The slow parts are seek and read.
Additional information:
The fields do not have fixed widths, but I know that a field is always an integer or a float and always between 1 and 5 characters and always belonging to the closed interval [0:2]. The fi
Solution
Fixed size value in binary file
It is not a problem if you have binary file with fixed size data.
In this case all you need is to copy your input file into output (or any other way of allocation equal amount of data) and then make
Maybe it will be faster to change reading curstor instead of writing cursor, it this case you have no need to preallocate large sized file, just add to the end off file.
Varies size value in text file
Have no idea how to implement than in memory and perfomance efficient way.
So if you can make some restrictions on input data, let it be binary file, in other case try to find out some other restrictions that, perheps, give you some advatages.
BTW: if it wont be critical you can try converting file from text to binary for transposing and back to text after that.
Explaining
Lets look at the followig matrix
$$
A = \left[ \begin{array}{ll}
a_{11} & a_{12} \\
a_{21} & a_{22}
\end{array}\right] $$
$$
A^T = \left[ \begin{array}{ll}
a_{11} & a_{21} \\
a_{12} & a_{22}
\end{array}\right]
$$
Your source file contain matrix \$A\$, while your destination file should contain \$A^T\$.
To write data into destination file in direct order you should jump over source file to read each element from current column.
It is not a problem if you have binary file with fixed size data.
In this case all you need is to copy your input file into output (or any other way of allocation equal amount of data) and then make
file.seek to next value you need to write.Maybe it will be faster to change reading curstor instead of writing cursor, it this case you have no need to preallocate large sized file, just add to the end off file.
Varies size value in text file
Have no idea how to implement than in memory and perfomance efficient way.
So if you can make some restrictions on input data, let it be binary file, in other case try to find out some other restrictions that, perheps, give you some advatages.
BTW: if it wont be critical you can try converting file from text to binary for transposing and back to text after that.
Explaining
Lets look at the followig matrix
$$
A = \left[ \begin{array}{ll}
a_{11} & a_{12} \\
a_{21} & a_{22}
\end{array}\right] $$
$$
A^T = \left[ \begin{array}{ll}
a_{11} & a_{21} \\
a_{12} & a_{22}
\end{array}\right]
$$
Your source file contain matrix \$A\$, while your destination file should contain \$A^T\$.
To write data into destination file in direct order you should jump over source file to read each element from current column.
# ROW_OFFSET length of row in binary file
# NUMBER_OF_ROWS in source file
# DATA_TYPE_SIZE
# current_column while looping all columns
for current_line in range(NUMBER_OF_ROWS):
sourceFile.seek(current_column * DATA_TYPE_SIZE + ROW_OFFSET * current_line)
destinationFile.write(sourceFile.read(DATA_TYPE_SIZE))Code Snippets
# ROW_OFFSET length of row in binary file
# NUMBER_OF_ROWS in source file
# DATA_TYPE_SIZE
# current_column while looping all columns
for current_line in range(NUMBER_OF_ROWS):
sourceFile.seek(current_column * DATA_TYPE_SIZE + ROW_OFFSET * current_line)
destinationFile.write(sourceFile.read(DATA_TYPE_SIZE))Context
StackExchange Code Review Q#64370, answer score: 2
Revisions (0)
No revisions yet.