patternpythonMinor
Retrieving the first occurrence of every unique value from a CSV column
Viewed 0 times
uniquethecolumncsveveryvaluefirstoccurrenceretrievingfrom
Problem
A large .csv file I was given has a large table of flight data. A function I wrote to help parse it iterates over the column of Flight IDs, and then returns a dictionary containing the index and value of every unique Flight ID in order of first appearance.
This comes as a quick adjustment to an older function that didn't require having to worry about
Example:
and so on for 5.3 million some rows.
Right now, I have it iterating over and comparing each value in order. If a unique value appears, it only stores the first occurrence in the dictionary. I changed it to now also check if that value has already occurred before, and if so, to skip it.
It's very inefficient, and slows down as the dictionary grows. The column has 5.2 million rows, so it's obviously not a good idea to handle this much with Python, but I'm stuck with it for now.
Is there a more efficient way to write this function?
Dictionary = { Index: FID, ... }This comes as a quick adjustment to an older function that didn't require having to worry about
FID repeats in the column (a few hundred thousand rows later...).Example:
20110117559515, ...
20110117559515, ...
20110117559515, ...
20110117559572, ...
20110117559572, ...
20110117559572, ...
20110117559574, ...
20110117559587, ...
20110117559588, ...and so on for 5.3 million some rows.
Right now, I have it iterating over and comparing each value in order. If a unique value appears, it only stores the first occurrence in the dictionary. I changed it to now also check if that value has already occurred before, and if so, to skip it.
def DiscoverEarliestIndex(self, number):
result = {}
columnvalues = self.column(number)
column_enum = {}
for a, b in enumerate(columnvalues):
column_enum[a] = b
i = 0
while i < (len(columnvalues) - 1):
next = column_enum[i+1]
if columnvalues[i] == next:
i += 1
else:
if next in result.values():
i += 1
continue
else:
result[i+1]= next
i += 1
else:
return resultIt's very inefficient, and slows down as the dictionary grows. The column has 5.2 million rows, so it's obviously not a good idea to handle this much with Python, but I'm stuck with it for now.
Is there a more efficient way to write this function?
Solution
You've given us a small piece of your problem to review, and Janne's suggestion seems reasonable if that piece is considered on its own. But I have the feeling that this isn't the only bit of analysis that you are doing on your data, and if so, you probably want to think about using a proper database.
Python comes with a built-in relational database engine in the form of the
(Obviously you'd need more fields in the
And then you can analyze the data by issuing SQL queries:
Python comes with a built-in relational database engine in the form of the
sqlite3 module. So you could easily read your CSV directly into a SQLite table, either using the .import command in SQLite's command-line shell, or via Python if you need more preprocessing:import sqlite3
import csv
def load_flight_csv(db_filename, csv_filename):
"""
Load flight data from `csv_filename` into the SQLite database in
`db_filename`.
"""
with sqlite3.connect(db_filename) as conn, open(csv_filename, 'rb') as f:
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS flight
(id INTEGER PRIMARY KEY AUTOINCREMENT, fid TEXT)''')
c.execute('''CREATE INDEX IF NOT EXISTS flight_fid ON flight (fid)''')
c.executemany('''INSERT INTO flight (fid) VALUES (?)''', csv.reader(f))
conn.commit()(Obviously you'd need more fields in the
CREATE TABLE statement, but since you didn't show them in your question, I can't guess what they might be.)And then you can analyze the data by issuing SQL queries:
>>> db = 'flight.db'
>>> load_flight_csv(db, 'flight.csv')
>>> conn = sqlite3.connect(db)
>>> from pprint import pprint
>>> pprint(conn.execute('''SELECT MIN(id), fid FROM flight GROUP BY fid''').fetchall())
[(1, u'20110117559515'),
(4, u'20110117559572'),
(7, u'20110117559574'),
(8, u'20110117559587'),
(9, u'20110117559588')]Code Snippets
import sqlite3
import csv
def load_flight_csv(db_filename, csv_filename):
"""
Load flight data from `csv_filename` into the SQLite database in
`db_filename`.
"""
with sqlite3.connect(db_filename) as conn, open(csv_filename, 'rb') as f:
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS flight
(id INTEGER PRIMARY KEY AUTOINCREMENT, fid TEXT)''')
c.execute('''CREATE INDEX IF NOT EXISTS flight_fid ON flight (fid)''')
c.executemany('''INSERT INTO flight (fid) VALUES (?)''', csv.reader(f))
conn.commit()>>> db = 'flight.db'
>>> load_flight_csv(db, 'flight.csv')
>>> conn = sqlite3.connect(db)
>>> from pprint import pprint
>>> pprint(conn.execute('''SELECT MIN(id), fid FROM flight GROUP BY fid''').fetchall())
[(1, u'20110117559515'),
(4, u'20110117559572'),
(7, u'20110117559574'),
(8, u'20110117559587'),
(9, u'20110117559588')]Context
StackExchange Code Review Q#24126, answer score: 3
Revisions (0)
No revisions yet.