patternpythonModerate
Read one million records per second
Viewed 0 times
millionperreadrecordsonesecond
Problem
The following code ingests 10k-20k records per second and I want to improve the performance of it. I'm reading json and ingesting it into the database using Kafka. I'm running it on the cluster of five nodes with zookeeper and Kafka installed on it.
Can you give me some tips to improve?
Can you give me some tips to improve?
import os
import json
from multiprocessing import Pool
from kafka.client import KafkaClient
from kafka.producer import SimpleProducer
def process_line(line):
producer = SimpleProducer(client)
try:
jrec = json.loads(line.strip())
producer.send_messages('twitter2613',json.dumps(jrec))
except ValueError, e:
{}
if __name__ == "__main__":
client = KafkaClient('10.62.84.35:9092')
myloop=True
pool = Pool(30)
direcToData = os.listdir("/FullData/RowData")
for loop in direcToData:
mydir2=os.listdir("/FullData/RowData/"+loop)
for i in mydir2:
if myloop:
with open("/FullData/RowData/"+loop+"/"+i) as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 30)Solution
Don't split reading from same file into different processes
I haven't used
A better tactic would be to send each file to a different process, and then let that process handle that file completely.
Choose number of processes wisely
Another caveat would be the number of processes you use. You create 30 processes, but as long as you don't have a 30 actual processors available you wont see any major performance gain using this number of processes.
Back in the days using various Unix based operating systems, we did compilations in batches and the general rule was that we would aim for approx 4 times the number of processors we had available. In other words on a quad-processor, we would aim for 16 processes. Any more and we started seeing congestion due to interprocess issues and IO related performance bottlenecks.
Shorten the distance from file to server
Another speedup can be found if you are able to avoid network traffic. That is if you are able to run this script directly on the
Establish a baseline
Not so much a performance suggestion, but do you have good baselines for how long it takes to do a typical run using only a single process? This can be helpful, when you start dividing the load according to other metrics, to compare when you reach a threshold regarding what each server/client should do.
I haven't used
Pool my self, but it seems like you're a splitting the file read over 30 different processes, where each reads 30 lines each? If that is correct, you should seriously consider a different split tactic, as that will throttle your IO seriously. You'll have 30 different processess trying to read from 30 different places in the file at the same time.A better tactic would be to send each file to a different process, and then let that process handle that file completely.
Choose number of processes wisely
Another caveat would be the number of processes you use. You create 30 processes, but as long as you don't have a 30 actual processors available you wont see any major performance gain using this number of processes.
Back in the days using various Unix based operating systems, we did compilations in batches and the general rule was that we would aim for approx 4 times the number of processors we had available. In other words on a quad-processor, we would aim for 16 processes. Any more and we started seeing congestion due to interprocess issues and IO related performance bottlenecks.
Shorten the distance from file to server
Another speedup can be found if you are able to avoid network traffic. That is if you are able to run this script directly on the
Kafka server, so that you can use loopback addresses and local connections instead of using the IP network.Establish a baseline
Not so much a performance suggestion, but do you have good baselines for how long it takes to do a typical run using only a single process? This can be helpful, when you start dividing the load according to other metrics, to compare when you reach a threshold regarding what each server/client should do.
Context
StackExchange Code Review Q#113124, answer score: 13
Revisions (0)
No revisions yet.