HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reading from a .txt file to a pandas dataframe

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
readingfilepandasdataframefromtxt

Problem

Having a text file './inputs/dist.txt' as:

1         1      2.92
     1         2     70.75
     1         3     60.90
     2         1     71.34
     2         2      5.23
     2         3     38.56
     3         1     61.24
     3         2     38.68
     3         3      4.49


I'm reading the text file to store it in a dataframe by doing:

from pandas import DataFrame
import pandas as pd
import os

def get_file_name( path):
    return os.path.basename(path).split(".")[0].strip().lower() 

name = get_file_name('./inputs/dist.txt')
with open('./inputs/dist.txt') as f:
    df = DataFrame(0.0, index=[1,2,3], columns=[1,2,3])
    for line in f:
        data = line.strip().split()
        row,column,value = [int(i) if i.isdigit() else float(i) for i in data]
        df.set_value(row,column,value)
m[name] = df


and I end up with a dataframe of the data. I have to read more bigger files that follow this format. Is there a faster way to redo this to improve runtime?

Solution

When opening very large files, first concern would be memory availability on your system to avoid swap on slower devices (i.e. disk).

Pandas is shipped with built-in reader methods. For example the pandas.read_table method seems to be a good way to read (also in chunks) a tabular data file.

In the specific case:

import pandas

df = pandas.read_table('./input/dists.txt', delim_whitespace=True, names=('A', 'B', 'C'))


will create a DataFrame objects with column named A made of data of type int64, B of int64 and C of float64.

You can by the way force the dtype giving the related dtype argument to read_table. For example forcing the second column to be float64.

import numpy as np
import pandas

df = pandas.read_table('./input/dists.txt', delim_whitespace=True, names=('A', 'B', 'C'),
                   dtype={'A': np.int64, 'B': np.float64, 'C': np.float64})

Code Snippets

import pandas

df = pandas.read_table('./input/dists.txt', delim_whitespace=True, names=('A', 'B', 'C'))
import numpy as np
import pandas

df = pandas.read_table('./input/dists.txt', delim_whitespace=True, names=('A', 'B', 'C'),
                   dtype={'A': np.int64, 'B': np.float64, 'C': np.float64})

Context

StackExchange Code Review Q#152194, answer score: 7

Revisions (0)

No revisions yet.