patternpythonMinor

Calculating overlap of segments in chromosome data

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview r bioinformatics stackoverflow performance python

chromosomeoverlapcalculatingsegmentsdata

Problem

I wrote an R code that basically performs 2 operations:

-
For each segment in file A, find all segments in file B that lie in that segment.

-
Find the percentage of overlap for each case in previous point.

The code that I have written works fine but given I have 74 files of type A and the rows in file B are about 200K, the approximate run time for this on my cluster is about ~18 days! I am writing the code and input files below and would really appreciate your comments on how I can improve the run time, with code if possible.

File A

Chromosome    Start       End
1         0           2420037
1         2420037     2522634
1         2522634     10794763
1         10794763    10925915
1         10925915    11280057
...

File B

chr       start   end     variantaccession    variantsubtype
1     10001   127330  nsv7879             Gain+Loss
1     10001   846808  dgv2n71             Gain
1     10377   177417  dgv1e1              Complex
1     10377   177417  nsv428112           Gain
1     10377   1018704 dgv3e1              Complex
1     10499   177368  esv27265            Gain+Loss
...

A lot of rows from file B will map to each row of file A. Once my code finds indices of mapped rows from file B it also extracts certain other values part of additional columns of file B.

setwd("/abc/xyz")

filenames =file$Start[j] & ref2$start[k]=file$Start[j] & ref2$end[k]

How I am calculating overlap:

File A, row 01: 1 2 3 4 5 6 7 8 9 10
File B, seg 01: 2 3 4
File B, seg 32: 5 6 7
File B, seg 12: 3 4 5 6 7 8 9
Consensus File B: 2 3 4 5 6 7 8 9
Percentage overlap = 8/10 -> 80%



Required Output (i.e add additional columns to File A)

Chromosome Start End Overlap_Count Overlap_Percentage Overlaps_Names Overlap_subtype
1 0 2420037 800 22.12% nsv7879,dgv1e1... Gain+Loss,Complex
1 2420037 2522634 626 35.12% nsv7879,dgv1e1... Gain+

Solution

I read the data in

a <- read.delim(textConnection("Chromosome\tStart\tEnd
1\t0\t2420037
1\t2420037\t2522634
1\t2522634\t10794763
1\t10794763\t10925915
1\t10925915\t11280057"))

b <- read.delim(textConnection("chr\tstart\tend\tvariantaccession\tvariantsubtype
1\t10001\t127330\tnsv7879\tGain+Loss
1\t10001\t846808\tdgv2n71\tGain
1\t10377\t177417\tdgv1e1\tComplex
1\t10377\t177417\tnsv428112\tGain
1\t10377\t1018704\tdgv3e1\tComplex
1\t10499\t177368\tesv27265\tGain+Loss"))

and then used the GenomicRanges package to efficiently find overlaps between the query "A" and subject "B".

library(GenomicRanges)
A <- makeGRangesFromDataFrame(a)
B <- makeGRangesFromDataFrame(b, keep.extra.columns=TRUE)
olaps <- findOverlaps(A, B)

olaps is basically a two-column matrix, indicating which element(s) of A overlap the corresponding elements of B. In the invocation above A and B enter symmetrically, but could be re-ordered if there were restrictions on the type of overlap (argument type, e.g., all of B strictly within A). To find the extent of overlap, I looked at the parallel (element wise) intersection of each overlap

> isect  isect
GRanges with 6 ranges and 0 metadata columns:
      seqnames           ranges strand
                   
  [1]        1 [10001,  127330]      *
  [2]        1 [10001,  846808]      *
  [3]        1 [10377,  177417]      *
  [4]        1 [10377,  177417]      *
  [5]        1 [10377, 1018704]      *
  [6]        1 [10499,  177368]      *
  ---
  seqlengths:
    1
   NA

I could then create a data.frame with whatever information I want, e.g.,

data.frame(query=queryHits(olaps), subject=subjectHits(olaps),
           olap_width=width(isect), 
           query_width=width(A)[queryHits(olaps)],
           variantaccession=B$variantaccession[subjectHits(olaps)])

This will be fast for millions of records in A and B.

makeGRangesFromDataFrame is available in the current release version of the package, installed by default when using R-3.1. It's easy to make a GRanges 'by hand', e.g.,

A <- with(a, GRanges(Chromosome, IRanges(Start, End)))
B <- with(b, GRanges(chr, IRanges(start, end), variantaccession=variantaccession))

Code Snippets

a <- read.delim(textConnection("Chromosome\tStart\tEnd
1\t0\t2420037
1\t2420037\t2522634
1\t2522634\t10794763
1\t10794763\t10925915
1\t10925915\t11280057"))

b <- read.delim(textConnection("chr\tstart\tend\tvariantaccession\tvariantsubtype
1\t10001\t127330\tnsv7879\tGain+Loss
1\t10001\t846808\tdgv2n71\tGain
1\t10377\t177417\tdgv1e1\tComplex
1\t10377\t177417\tnsv428112\tGain
1\t10377\t1018704\tdgv3e1\tComplex
1\t10499\t177368\tesv27265\tGain+Loss"))

library(GenomicRanges)
A <- makeGRangesFromDataFrame(a)
B <- makeGRangesFromDataFrame(b, keep.extra.columns=TRUE)
olaps <- findOverlaps(A, B)

> isect <- pintersect(A[queryHits(olaps)], B[subjectHits(olaps)])
> isect
GRanges with 6 ranges and 0 metadata columns:
      seqnames           ranges strand
         <Rle>        <IRanges>  <Rle>
  [1]        1 [10001,  127330]      *
  [2]        1 [10001,  846808]      *
  [3]        1 [10377,  177417]      *
  [4]        1 [10377,  177417]      *
  [5]        1 [10377, 1018704]      *
  [6]        1 [10499,  177368]      *
  ---
  seqlengths:
    1
   NA

data.frame(query=queryHits(olaps), subject=subjectHits(olaps),
           olap_width=width(isect), 
           query_width=width(A)[queryHits(olaps)],
           variantaccession=B$variantaccession[subjectHits(olaps)])

A <- with(a, GRanges(Chromosome, IRanges(Start, End)))
B <- with(b, GRanges(chr, IRanges(start, end), variantaccession=variantaccession))

Context

StackExchange Code Review Q#57785, answer score: 2

Revisions (0)

No revisions yet.