HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Read one million records per second

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
millionperreadrecordsonesecond

Problem

The following code ingests 10k-20k records per second and I want to improve the performance of it. I'm reading json and ingesting it into the database using Kafka. I'm running it on the cluster of five nodes with zookeeper and Kafka installed on it.

Can you give me some tips to improve?

import os
import json
from multiprocessing import Pool
from kafka.client import KafkaClient
from kafka.producer import SimpleProducer

def process_line(line):
    producer = SimpleProducer(client)
    try:
       jrec = json.loads(line.strip())
       producer.send_messages('twitter2613',json.dumps(jrec))
    except ValueError, e:
                {}

if __name__ == "__main__":
    client = KafkaClient('10.62.84.35:9092')
    myloop=True
    pool = Pool(30)

    direcToData = os.listdir("/FullData/RowData")
    for loop in direcToData:
        mydir2=os.listdir("/FullData/RowData/"+loop)

        for i in mydir2:
            if  myloop:
                 with open("/FullData/RowData/"+loop+"/"+i) as source_file:
                     # chunk the work into batches of 4 lines at a time
                     results = pool.map(process_line, source_file, 30)

Solution

Don't split reading from same file into different processes

I haven't used Pool my self, but it seems like you're a splitting the file read over 30 different processes, where each reads 30 lines each? If that is correct, you should seriously consider a different split tactic, as that will throttle your IO seriously. You'll have 30 different processess trying to read from 30 different places in the file at the same time.

A better tactic would be to send each file to a different process, and then let that process handle that file completely.

Choose number of processes wisely

Another caveat would be the number of processes you use. You create 30 processes, but as long as you don't have a 30 actual processors available you wont see any major performance gain using this number of processes.

Back in the days using various Unix based operating systems, we did compilations in batches and the general rule was that we would aim for approx 4 times the number of processors we had available. In other words on a quad-processor, we would aim for 16 processes. Any more and we started seeing congestion due to interprocess issues and IO related performance bottlenecks.

Shorten the distance from file to server

Another speedup can be found if you are able to avoid network traffic. That is if you are able to run this script directly on the Kafka server, so that you can use loopback addresses and local connections instead of using the IP network.

Establish a baseline

Not so much a performance suggestion, but do you have good baselines for how long it takes to do a typical run using only a single process? This can be helpful, when you start dividing the load according to other metrics, to compare when you reach a threshold regarding what each server/client should do.

Context

StackExchange Code Review Q#113124, answer score: 13

Revisions (0)

No revisions yet.