patternpythonMinor
Calculating overlap of segments in chromosome data
Viewed 0 times
chromosomeoverlapcalculatingsegmentsdata
Problem
I wrote an R code that basically performs 2 operations:
-
For each segment in file A, find all segments in file B that lie in that segment.
-
Find the percentage of overlap for each case in previous point.
The code that I have written works fine but given I have 74 files of type A and the rows in file B are about 200K, the approximate run time for this on my cluster is about ~18 days! I am writing the code and input files below and would really appreciate your comments on how I can improve the run time, with code if possible.
File A
File B
A lot of rows from file B will map to each row of file A. Once my code finds indices of mapped rows from file B it also extracts certain other values part of additional columns of file B.
File B, seg 01: 2 3 4
File B, seg 32: 5 6 7
File B, seg 12: 3 4 5 6 7 8 9
Consensus File B: 2 3 4 5 6 7 8 9
Percentage overlap = 8/10 -> 80%
1 0 2420037 800 22.12% nsv7879,dgv1e1... Gain+Loss,Complex
1 2420037 2522634 626 35.12% nsv7879,dgv1e1... Gain+
-
For each segment in file A, find all segments in file B that lie in that segment.
-
Find the percentage of overlap for each case in previous point.
The code that I have written works fine but given I have 74 files of type A and the rows in file B are about 200K, the approximate run time for this on my cluster is about ~18 days! I am writing the code and input files below and would really appreciate your comments on how I can improve the run time, with code if possible.
File A
Chromosome Start End
1 0 2420037
1 2420037 2522634
1 2522634 10794763
1 10794763 10925915
1 10925915 11280057
...
File B
chr start end variantaccession variantsubtype
1 10001 127330 nsv7879 Gain+Loss
1 10001 846808 dgv2n71 Gain
1 10377 177417 dgv1e1 Complex
1 10377 177417 nsv428112 Gain
1 10377 1018704 dgv3e1 Complex
1 10499 177368 esv27265 Gain+Loss
...
A lot of rows from file B will map to each row of file A. Once my code finds indices of mapped rows from file B it also extracts certain other values part of additional columns of file B.
setwd("/abc/xyz")
filenames =file$Start[j] & ref2$start[k]=file$Start[j] & ref2$end[k]
How I am calculating overlap:
File A, row 01: 1 2 3 4 5 6 7 8 9 10File B, seg 01: 2 3 4
File B, seg 32: 5 6 7
File B, seg 12: 3 4 5 6 7 8 9
Consensus File B: 2 3 4 5 6 7 8 9
Percentage overlap = 8/10 -> 80%
Required Output (i.e add additional columns to File A)
Chromosome Start End Overlap_Count Overlap_Percentage Overlaps_Names Overlap_subtype1 0 2420037 800 22.12% nsv7879,dgv1e1... Gain+Loss,Complex
1 2420037 2522634 626 35.12% nsv7879,dgv1e1... Gain+
Solution
I read the data in
and then used the GenomicRanges package to efficiently find overlaps between the query "A" and subject "B".
I could then create a data.frame with whatever information I want, e.g.,
This will be fast for millions of records in A and B.
a <- read.delim(textConnection("Chromosome\tStart\tEnd
1\t0\t2420037
1\t2420037\t2522634
1\t2522634\t10794763
1\t10794763\t10925915
1\t10925915\t11280057"))
b <- read.delim(textConnection("chr\tstart\tend\tvariantaccession\tvariantsubtype
1\t10001\t127330\tnsv7879\tGain+Loss
1\t10001\t846808\tdgv2n71\tGain
1\t10377\t177417\tdgv1e1\tComplex
1\t10377\t177417\tnsv428112\tGain
1\t10377\t1018704\tdgv3e1\tComplex
1\t10499\t177368\tesv27265\tGain+Loss"))and then used the GenomicRanges package to efficiently find overlaps between the query "A" and subject "B".
library(GenomicRanges)
A <- makeGRangesFromDataFrame(a)
B <- makeGRangesFromDataFrame(b, keep.extra.columns=TRUE)
olaps <- findOverlaps(A, B)olaps is basically a two-column matrix, indicating which element(s) of A overlap the corresponding elements of B. In the invocation above A and B enter symmetrically, but could be re-ordered if there were restrictions on the type of overlap (argument type, e.g., all of B strictly within A). To find the extent of overlap, I looked at the parallel (element wise) intersection of each overlap> isect isect
GRanges with 6 ranges and 0 metadata columns:
seqnames ranges strand
[1] 1 [10001, 127330] *
[2] 1 [10001, 846808] *
[3] 1 [10377, 177417] *
[4] 1 [10377, 177417] *
[5] 1 [10377, 1018704] *
[6] 1 [10499, 177368] *
---
seqlengths:
1
NAI could then create a data.frame with whatever information I want, e.g.,
data.frame(query=queryHits(olaps), subject=subjectHits(olaps),
olap_width=width(isect),
query_width=width(A)[queryHits(olaps)],
variantaccession=B$variantaccession[subjectHits(olaps)])This will be fast for millions of records in A and B.
makeGRangesFromDataFrame is available in the current release version of the package, installed by default when using R-3.1. It's easy to make a GRanges 'by hand', e.g.,A <- with(a, GRanges(Chromosome, IRanges(Start, End)))
B <- with(b, GRanges(chr, IRanges(start, end), variantaccession=variantaccession))Code Snippets
a <- read.delim(textConnection("Chromosome\tStart\tEnd
1\t0\t2420037
1\t2420037\t2522634
1\t2522634\t10794763
1\t10794763\t10925915
1\t10925915\t11280057"))
b <- read.delim(textConnection("chr\tstart\tend\tvariantaccession\tvariantsubtype
1\t10001\t127330\tnsv7879\tGain+Loss
1\t10001\t846808\tdgv2n71\tGain
1\t10377\t177417\tdgv1e1\tComplex
1\t10377\t177417\tnsv428112\tGain
1\t10377\t1018704\tdgv3e1\tComplex
1\t10499\t177368\tesv27265\tGain+Loss"))library(GenomicRanges)
A <- makeGRangesFromDataFrame(a)
B <- makeGRangesFromDataFrame(b, keep.extra.columns=TRUE)
olaps <- findOverlaps(A, B)> isect <- pintersect(A[queryHits(olaps)], B[subjectHits(olaps)])
> isect
GRanges with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [10001, 127330] *
[2] 1 [10001, 846808] *
[3] 1 [10377, 177417] *
[4] 1 [10377, 177417] *
[5] 1 [10377, 1018704] *
[6] 1 [10499, 177368] *
---
seqlengths:
1
NAdata.frame(query=queryHits(olaps), subject=subjectHits(olaps),
olap_width=width(isect),
query_width=width(A)[queryHits(olaps)],
variantaccession=B$variantaccession[subjectHits(olaps)])A <- with(a, GRanges(Chromosome, IRanges(Start, End)))
B <- with(b, GRanges(chr, IRanges(start, end), variantaccession=variantaccession))Context
StackExchange Code Review Q#57785, answer score: 2
Revisions (0)
No revisions yet.