patternpythonCritical
Import multiple CSV files into pandas and concatenate into one DataFrame
Viewed 0 times
pandasmultipleimportintocsvanddataframefilesconcatenateone
Problem
I would like to read several CSV files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:
I guess I need some help within the for loop?
import glob
import pandas as pd
# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)I guess I need some help within the for loop?
Solution
See pandas: IO tools for all of the available
Try the following code if all of the CSV files have the same columns.
I have added
Or, with attribution to a comment from Sid.
Imports and Setup
Option 1:
Option 2:
Option 3:
Option 4:
or
.read_ methods.Try the following code if all of the CSV files have the same columns.
I have added
header=0, so that after reading the CSV file's first row, it can be assigned as the column names.import pandas as pd
import glob
import os
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)Or, with attribution to a comment from Sid.
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
- It's often necessary to identify each sample of data, which can be accomplished by adding a new column to the dataframe.
pathlibfrom the standard library will be used for this example. It treats paths as objects with methods, instead of strings to be sliced.
Imports and Setup
from pathlib import Path
import pandas as pd
import numpy as np
path = r'C:\DRO\DCL_rawdata_files' # or unix / linux / mac path
# Get the files from the path provided in the OP
files = Path(path).glob('*.csv') # .rglob to get subdirectories
Option 1:
- Add a new column with the file name
dfs = list()
for f in files:
data = pd.read_csv(f)
# .stem is method for pathlib objects to get the filename w/o the extension
data['file'] = f.stem
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
Option 2:
- Add a new column with a generic name using
enumerate
dfs = list()
for i, f in enumerate(files):
data = pd.read_csv(f)
data['file'] = f'File {i}'
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
Option 3:
- Create the dataframes with a list comprehension, and then use
np.repeatto add a new column.
[f'S{i}' for i in range(len(dfs))]creates a list of strings to name each dataframe.
[len(df) for df in dfs]creates a list of lengths
- Attribution for this option goes to this plotting answer.
# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]
# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)
# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])
Option 4:
- One liners using
.assignto create the new column, with attribution to a comment from C8H10N4O2
df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)
or
df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)
Code Snippets
import pandas as pd
import glob
import os
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)Context
Stack Overflow Q#20906474, score: 905
Revisions (0)
No revisions yet.