patternpythonMinor
Data-pulling fuctions
Viewed 0 times
fuctionspullingdata
Problem
The first function,
Since I never know how many database entries will be added within the 5 seconds that the first function runs on, the second function,
The third function,
Here is how I would like to optimize my code:
My next step in this process is to calculate the length of time between a users first and last entries in the database. We can assume that the max time between a user's entry and exit is 2 hours (in this time, the users would probably have 10-15 pings with different
How would you pull data that is coming in at a variable and unpredictable rate?
```
import mysql.connector
import datetime
import requests
import time
run = True
def sqlPull():
connection = mysql.connector.connect(user='XXXXX', password='XXXXX', host='XXXXX', database='XXXXX')
cursor = connection.cursor()
cursor.execute("SELECT TimeStamp, MAC, RSSI FROM wifiscan ORDER BY TimeStamp DESC LIMIT 20;")
sqlPull(), connects to a local MySQL database and pulls the last 20 rows from the database every 5 seconds. The data coming in is a list of tuples, where MAC, RSSI and TimeStamp are the 3 elements in each tuple. The data is entering the database at a variable rate, where sometimes I'll have tons of entries in a minute and sometimes I will have very few. The data that is coming in consists of a MAC address, an RSSI and a TimeStamp for every time a user's device pings our hardware. So, we will have unique MAC addresses that will have changing RSSIs and TimeStamps over a period of time. Since I never know how many database entries will be added within the 5 seconds that the first function runs on, the second function,
dupCatch() is used to detect and eliminate duplicates.The third function,
post(), just sends a POST request to Google Analytics with the data stripped of duplicates.Here is how I would like to optimize my code:
My next step in this process is to calculate the length of time between a users first and last entries in the database. We can assume that the max time between a user's entry and exit is 2 hours (in this time, the users would probably have 10-15 pings with different
RSSIs and timestamps). I have been trying to think of the best way to do this with the current code, but everything I am thinking of seems so unwieldy. In my mind, sqlPull() and/or dupCatch() could be rewritten/optimized to better serve my purposes for a function to calculate length of time between entry and exit.How would you pull data that is coming in at a variable and unpredictable rate?
```
import mysql.connector
import datetime
import requests
import time
run = True
def sqlPull():
connection = mysql.connector.connect(user='XXXXX', password='XXXXX', host='XXXXX', database='XXXXX')
cursor = connection.cursor()
cursor.execute("SELECT TimeStamp, MAC, RSSI FROM wifiscan ORDER BY TimeStamp DESC LIMIT 20;")
Solution
You need rate-limit the amount of records you receive. You'll have to determine what the best rate is by yourself, as that is specific to your scenario.
High Priority:
-
Keep the list of "synced" entries in the database. Something like an
-
After you process a batch, update the database records you just selected with the synced set to yes, or 1, or true. The simplest way to do this would be to update each entry one by one, but there are more efficient solutions to that out there.
-
Your database query should only select from items that are not yet synced. This will be lighter on the database. (i.e. add the predicate
-
Your
-
Please, please add a surrogate primary key to your database. If anything, it keeps you from having to generate a hash every time you want to do a
-
The above entry kind of depends on how you generate the data. But since you have timestamp in there, I'm assuming every single record is unique anyways, so you might as well use the surrogate key as suggested. By the way, the surrogate key will be a sequence used as a primary key. Every entry inserted gets a unique value.
-
Edit. From your feedback below, you say you sometimes have duplicate values for the same timestamp. If that's the case, then you need to filter them out before inserting into the database initially. Whatever job/service that creates those entries needs to filter out the duplicates. You can try insert, and catch the DBException because it will fail the column constraint. Or, you could first do a select, and if no result is returned, insert, otherwise discard. You'll need to edit your question with more info for us to give you more detail on that.
Low Priority:
-
Move your sleep method into your while loop, not in your SQL pull, please. There is no need to sleep before returning data. Put the sleep where it belongs. Either that or change your function to be called
-
Reuse your connection. You are disconnecting and reconnecting to the database every 5 seconds, it's pointless unless you expect your database connection to drop.
-
Change your while loop to just be "while Run", that's like saying
-
Pull from the database at constant intervals(you are using 5s, play around with it).
-
Enforce an upper-bound amount of returned entries (you have 20)
-
The last two entries above will help you rate-limit how many requests you send off to Google. If you spam them too quickly, they might block/ban your IP.
High Priority:
-
Keep the list of "synced" entries in the database. Something like an
IsSentToAnalytics column.-
After you process a batch, update the database records you just selected with the synced set to yes, or 1, or true. The simplest way to do this would be to update each entry one by one, but there are more efficient solutions to that out there.
-
Your database query should only select from items that are not yet synced. This will be lighter on the database. (i.e. add the predicate
and where IsSentToAnalytics <> 1)-
Your
dupCheck is catering for the above flag, essentially you're pulling unused data every time if it's a quiet period hence why you need the dupCheck. And also, MISSING data if it's a busy period. That can happen if you got more than 20 entries added to your database since your last run. This may be fine for hobby projects, but shoddy for production deployments.-
Please, please add a surrogate primary key to your database. If anything, it keeps you from having to generate a hash every time you want to do a
dupCheck.-
The above entry kind of depends on how you generate the data. But since you have timestamp in there, I'm assuming every single record is unique anyways, so you might as well use the surrogate key as suggested. By the way, the surrogate key will be a sequence used as a primary key. Every entry inserted gets a unique value.
-
Edit. From your feedback below, you say you sometimes have duplicate values for the same timestamp. If that's the case, then you need to filter them out before inserting into the database initially. Whatever job/service that creates those entries needs to filter out the duplicates. You can try insert, and catch the DBException because it will fail the column constraint. Or, you could first do a select, and if no result is returned, insert, otherwise discard. You'll need to edit your question with more info for us to give you more detail on that.
Low Priority:
-
Move your sleep method into your while loop, not in your SQL pull, please. There is no need to sleep before returning data. Put the sleep where it belongs. Either that or change your function to be called
PullDataAndSleep. If you think that function name is fine, then don't change the code inside it, just rename. Otherwise, put the sleep elsewhere.-
Reuse your connection. You are disconnecting and reconnecting to the database every 5 seconds, it's pointless unless you expect your database connection to drop.
-
Change your while loop to just be "while Run", that's like saying
while var == True. You're basically saying if True == True, then run. It's redundant.-
Pull from the database at constant intervals(you are using 5s, play around with it).
-
Enforce an upper-bound amount of returned entries (you have 20)
-
The last two entries above will help you rate-limit how many requests you send off to Google. If you spam them too quickly, they might block/ban your IP.
Context
StackExchange Code Review Q#35566, answer score: 2
Revisions (0)
No revisions yet.