patternpythonMinor
Catalogue manipulation
Viewed 0 times
manipulationcataloguestackoverflow
Problem
I have a catalogue of 56000 objects and I want to replace some columns with new values for instance replace
and subtract it from some other columns and so on.
Using
The function is given by:
```
from astropy import units as u
from astropy.coordinates import SkyCoord
from scipy.ndimage.interpolation import map_coordinates
import os.path
from collections import Iterable
import pyfits as fits
def get_ebv_from_map(coordinates, interpolate=True, order=1):
mapdir='/vol/Dust_Map/maps/'
fname = os.path.join(mapdir, 'SFD_dust_4096_{0}.fits')
# Parse input
ra, dec = coordinates
coordinates = SkyCoord(ra=ra, dec=dec, unit=(u.degree, u.degree))
# Convert to galactic coordinates.
coordinates = coordinates.galactic
l = coordinates.l.radian
b = coordinates.b.radian
# Check if l, b are scalar
return_scalar = False
if not isinstance(l, Iterable):
return_scalar = True
l, b = np.array([l]), np.array([b])
# Initialize return array
ebv = np.empty_like(l)
# Treat north (b>0) separately from south (b= 0, 'ngp'), (-1, b < 0, 'sgp')]:
if not np.any(idx): continue
hdulist = fits.open(fname.format(ext))
mapd = hdulist[0].data
# Project from galactic longitude/latitude to lambert pixels.
# (See SFD98).
npix = mapd.shape[0]
x = (npix / 2 np.cos(l[idx]) np.sqrt(1. - n*np.sin(b[idx])) +
npix / 2 - 0.5)
y = (-npix / 2 n np.sin(l[idx]) np.sqrt(1. - nnp.sin(b[idx])) +
npix / 2 - 0.5)
# Get map values at these pixel coordinates.
if interpolate:
NaN with -99 and 0 for specific columns (as you can see in the following code). Then I need to compute a value for some columns with a function which is called get_ebv_from_mapand subtract it from some other columns and so on.
Using
get_ebv_from_map function to estimate the values in the catalogue is taking too much time and I think what I did like using this loop is not very efficient.The function is given by:
```
from astropy import units as u
from astropy.coordinates import SkyCoord
from scipy.ndimage.interpolation import map_coordinates
import os.path
from collections import Iterable
import pyfits as fits
def get_ebv_from_map(coordinates, interpolate=True, order=1):
mapdir='/vol/Dust_Map/maps/'
fname = os.path.join(mapdir, 'SFD_dust_4096_{0}.fits')
# Parse input
ra, dec = coordinates
coordinates = SkyCoord(ra=ra, dec=dec, unit=(u.degree, u.degree))
# Convert to galactic coordinates.
coordinates = coordinates.galactic
l = coordinates.l.radian
b = coordinates.b.radian
# Check if l, b are scalar
return_scalar = False
if not isinstance(l, Iterable):
return_scalar = True
l, b = np.array([l]), np.array([b])
# Initialize return array
ebv = np.empty_like(l)
# Treat north (b>0) separately from south (b= 0, 'ngp'), (-1, b < 0, 'sgp')]:
if not np.any(idx): continue
hdulist = fits.open(fname.format(ext))
mapd = hdulist[0].data
# Project from galactic longitude/latitude to lambert pixels.
# (See SFD98).
npix = mapd.shape[0]
x = (npix / 2 np.cos(l[idx]) np.sqrt(1. - n*np.sin(b[idx])) +
npix / 2 - 0.5)
y = (-npix / 2 n np.sin(l[idx]) np.sqrt(1. - nnp.sin(b[idx])) +
npix / 2 - 0.5)
# Get map values at these pixel coordinates.
if interpolate:
Solution
I'll echo the comments and say that a profile would really help you as far as diagnosing performance problems. Profile before trying any kind of performance optimizations, because whatever you think is going on, that's probably not it. (This applies to me as well, so take everything I say below about performance with a grain of salt.)
Your code is kind of hard to read. I would review PEP 8 for style advice. In particular, spaces around your operators would help a lot. You do this pretty consistently in
Your
I'm not too familiar with this library, so mock me as I deserve if my understanding of the code is egregiously wrong, but offhand, it looks to me like you're doing lots of disk IO; these lines:
look like they're opening and reading a file on every iteration of the loop and extracting one value, which is the only thing used in the rest of the loop body. You then call this function in another loop, so you have a loop in which you have a loop in which you open and read a file on every iteration. I'm guessing at least one of those loops does a substantial number of iterations, so that could be a huge performance drag, since disk access is slow. And if I'm reading your code correctly, you only have two unique file extensions,
To test this, I would try isolating that loop into its own function and profiling to see how its speed compares with the rest of the code. Something like:
and call that inside
If that does turn out to be a problem, I would try to pre-read these files somehow, or save the values you need somewhere and reuse them across invocations. If the items you're reading fit reasonably in main memory, I would try to cache them. Something like this:
I'm not familiar enough with what happens when things get bigger than main memory to offer any suggestions if your data surpasses the size of main memory.
Your code is kind of hard to read. I would review PEP 8 for style advice. In particular, spaces around your operators would help a lot. You do this pretty consistently in
get_ebv_from_map, but in the driving loop, everything's all scrunched up.Your
get_ebv_from_map function is really long and has all this different stuff going on. That could be masking performance problems, and it'll make profiling harder. If you split some of it into separate functions, the profiler will give you individual performance statistics for those functions, which could be a big help. I'm not too familiar with this library, so mock me as I deserve if my understanding of the code is egregiously wrong, but offhand, it looks to me like you're doing lots of disk IO; these lines:
for n, idx, ext in [(1, b >= 0, 'ngp'), (-1, b < 0, 'sgp')]:
if not np.any(idx): continue
hdulist = fits.open(fname.format(ext)) # this one
mapd = hdulist[0].datalook like they're opening and reading a file on every iteration of the loop and extracting one value, which is the only thing used in the rest of the loop body. You then call this function in another loop, so you have a loop in which you have a loop in which you open and read a file on every iteration. I'm guessing at least one of those loops does a substantial number of iterations, so that could be a huge performance drag, since disk access is slow. And if I'm reading your code correctly, you only have two unique file extensions,
'ngp' and 'sgp', and your fname doesn't seem to be changing, so you're opening and reading the same two files over and over again, sticking the first value (hdulist[0].data) into a variable, and throwing away that value when the iteration is done.To test this, I would try isolating that loop into its own function and profiling to see how its speed compares with the rest of the code. Something like:
def main_loop(a):
for n, idx, ext in [(1, b >= 0, 'ngp'), (-1, b < 0, 'sgp')]:
# Rest of the codeand call that inside
get_ebv_from_map where you now have the main loop. Profile, and compare the runtime of the main_loop function with the rest of the get_ebv_from_map function.If that does turn out to be a problem, I would try to pre-read these files somehow, or save the values you need somewhere and reuse them across invocations. If the items you're reading fit reasonably in main memory, I would try to cache them. Something like this:
for n, idx, ext in [(1, b >= 0, 'ngp'), (-1, b < 0, 'sgp')]:
# Other necessary code
if ext not in mapdcache:
with fits.open(fname.format(ext)) as hdulist:
mapdcache[ext] = hdulist[0].data
else:
mapd = mapdcache[ext]mapdcache is a dictionary or list declared at some outer scope (I wrote that code like it was a dictionary). If your files are fixed across calls of get_ebv_from_map (which seems to be the case since you have mapdir and fname as constants), you can put mapdcache outside of get_ebv_from_map and you'll just have to read each file once across the entire run of the program. Otherwise it can go inside, just above this loop, and you'll at least avoid repeated file IO within one call.I'm not familiar enough with what happens when things get bigger than main memory to offer any suggestions if your data surpasses the size of main memory.
Code Snippets
for n, idx, ext in [(1, b >= 0, 'ngp'), (-1, b < 0, 'sgp')]:
if not np.any(idx): continue
hdulist = fits.open(fname.format(ext)) # this one
mapd = hdulist[0].datadef main_loop(a):
for n, idx, ext in [(1, b >= 0, 'ngp'), (-1, b < 0, 'sgp')]:
# Rest of the codefor n, idx, ext in [(1, b >= 0, 'ngp'), (-1, b < 0, 'sgp')]:
# Other necessary code
if ext not in mapdcache:
with fits.open(fname.format(ext)) as hdulist:
mapdcache[ext] = hdulist[0].data
else:
mapd = mapdcache[ext]Context
StackExchange Code Review Q#63863, answer score: 2
Revisions (0)
No revisions yet.