HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Detecting a subsequence that's an arithmetic progression, in a sorted sequence

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
arithmeticsubsequencesequencethatsorteddetectingprogression

Problem

I have following problem: I have a sorted sequence of $N$ integers (assume they are monotonically increasing). I want to check whether there is any subsequence of length $\ge N/4$, such that consecutive elements of the subsequence all differ by the same value.

For example, in the sequence [3,4,5,8,12] there are two such subsequences: [3,4,5] (the difference is 1) and [4,8,12] (the difference is 4). Thus, the length of longest such subsequence is 3 for this example. Since $3 \ge 5/4$, the answer is yes, there is a subsequence of length $\ge N/4$ with the desired property.

In my real-life situation, the sequence is of length $N\approx 10^6$, and the elements are all 9-digit numbers. Is there an efficient algorithm to solve this problem?

My naive approach was to create Cartesian product with absolute differences between numbers:

$$
\left( \begin{array}{ccccc}
0 & 1 & 2 & 5 & 9 \\
1 & 0 & 1 & 4 & 8 \\
2 & 1 & 0 & 3 & 7 \\
5 & 4 & 3 & 0 & 4 \\
9 & 8 & 7 & 4 & 0 \end{array} \right) $$

And then focus on top-right part and compute number of occurrences of each difference, so:

$$
||\text{diff-by-1}|| = 2 => \text{3 numbers diff by 1}\\
||\text{diff-by-4}|| = 2 => \text{3 numbers diff by 4}
$$

This is very simple and very ineffective. It requires lot of comparisons and does not scale (at all): its running time is $\Theta(N^2)$. In my real life scenario my sequence is ~10^6 long, so this is too slow.

To give you wider picture as maybe there is much better (probabilistic) approach to this problem: after largest sub-sequence is found I want to compute simple ratio:

$$
r:=\frac{\text{largest sub-sequence length}}{\text{sequence length}}
$$

and if $r$ is greater then some fixed value I want to raise alarm (or do whatever I have to do ;-)).

Thanks for any help, references, pointers, etc.

BTW: here are things that I was/am looking at:

  • http://link.springer.com/article/10.1007/s00453-009-9376-2



  • http://en.wikipedia.org/wiki/Longest_increasing_subsequence_proble

Solution

You recently clarified that you want to solve the following problem:


Given an input sequence $L[0..N-1]$, check whether an arithmetic progression of length $\ge N/4$ appears as a subsequence of $L$.

This is a significantly easier problem, and I can suggest a very efficient algorithm for you. Let me build up to it by first explaining a few key ideas underlying the algorithm.

Idea 1. Given any two elements of $L$, it is easy to extend them to the maximum-length arithmetic progression containing those two elements. In particular, suppose we have $x,y \in L$ with $x

-
For each $i_0 \in \{0.2N, 0.4N, 0.6N, 0.8N\}$, do:

-
For each $i$ such that $i\ge i_0$ and $L[i] \le L[i_0]+d_\max$, do:

-
For each $j$ such that $j>i$ and $L[j] \le L[i]+d_\max$, do:

  • Extend the arithmetic progression $L[i],L[j]$ on the left and right as far as possible (using Idea 1).



-
Look at the longest arithmetic progression found at any point above. If it has length $\ge N/4$, then yes, there exists an arithmetic progression of length $\ge N/4$. If its length is $

Hopefully you can see why this works. Idea 3 means that, if there is an arithmetic progression of length $\ge N/4$, then at least one of the indices $i$ that are visited in the above algorithm must be the index of one element of that progression. Moreover, the index of the next element of that progression must be one of the values taken on by $j$ in the innermost loop. Therefore, at some point the pair $L[i],L[j]$ will be a pair of consecutive elements of the arithmetic progression -- which is enough to discover that long arithmetic progression.

I would expect that this algorithm will typically be very efficient, on real-world data. If all of the integers in $L$ are in the range $[1,M]$, then we know $d_\max \le 4M/N$. Moreover, the loop over $i$ typically involves about $d_\max N/M \le 4$ iterations (on average), and similarly for the loop over $j$. So, we expect to perform maybe $5 \times 4 \times 4 = 80$ iterations of the innermost statement, regardless of $N$, if the data is randomly chosen (and in particular, not adversarily chosen). Consequently, this should be extremely efficient if the data doesn't conspire against you.

It is possible to further improve this algorithm, for instance, to improve its worst-case running time and reduce the opportunity for worst-case data to cause the running time to blow up, at the cost of making the code more complicated. However, I suspect that this simple algorithm might be sufficient for your needs.

Context

StackExchange Computer Science Q#18951, answer score: 4

Revisions (0)

No revisions yet.