patternMinor
Uniformly Random Nested Subset Pairs
Viewed 0 times
randomuniformlynestedsubsetpairs
Problem
Main Question
We can represent subsets of a vector using, say, a bit mask. Let's say a nested subset is a pair of masks, for example
is a nested subset, where the first mask tells us inclusion in the first set and the second mask tells us inclusion in the first. In this case the first set is
Some Thoughts
One could try naively to generate two random masks, use the first for the union, then the second for the intersection where it overlaps. However, this would generate subsets uniform in the initial subset, and not the pair.
Mathematically speaking this can be done by weighting each mask by the number of possible subsets it has (
My first idea was to find some (efficient) hash function from integers into subset pairs, however I have not had much success with this. I am wondering if there is a superior way to do this. I don't need code (I don't have implementation issues); I am looking a conceptual approach. Any advice?
We can represent subsets of a vector using, say, a bit mask. Let's say a nested subset is a pair of masks, for example
A B C D E F
1 0 0 1 0 1
1 0 0 0 0 1is a nested subset, where the first mask tells us inclusion in the first set and the second mask tells us inclusion in the first. In this case the first set is
{A,D,F} and its subset is {A,F}. I am trying to uniformly generate such subset pairs.Some Thoughts
One could try naively to generate two random masks, use the first for the union, then the second for the intersection where it overlaps. However, this would generate subsets uniform in the initial subset, and not the pair.
Mathematically speaking this can be done by weighting each mask by the number of possible subsets it has (
2^n where n is the size of the subset), randomly drawing a subset, then randomly drawing an intersection from it. However this is very challenging/inefficient to implement in practiceMy first idea was to find some (efficient) hash function from integers into subset pairs, however I have not had much success with this. I am wondering if there is a superior way to do this. I don't need code (I don't have implementation issues); I am looking a conceptual approach. Any advice?
Solution
For the sake of simplicity, I will assume that: 1) empty sets are allowed, 2) we count a set as a subset of itself (i.e.: both the big set and small set can represent the same set). If these assumptions don't hold, the math gets a little messier but the basic principles still hold.
First, we observe that there are $2^i{n\choose i}$ pairs of subsets where the bigger subset has cardinality $i$ and that there are a total of $\sum_i 2^i{n\choose i} = 3^n$ such pairs.
We proceed as follows:
1) We choose $k$ (between $0$ and $n$), the size of the bigger subset. This will not be a uniformly random choice but our initial observation lets us know the correct probabilities, i.e.: $P(k = i) = 2^i{n\choose i} / 3^n$.
2) We choose uniformly among the ${n \choose k}$ subsets of cardinality $k$ to be our bigger subset. Reservoir sampling should do the trick.
3) We choose uniformly among the $2^k$ subsets of the bigger subset to be the smaller subset. Simply choosing a random number from $0$ to $2^k-1$ and using the natural correspondence between sets and binary representations of integers should suffice.
EDITED TO ADD:
I just realized an even slicker way to do this but it depends heavily on the first two assumptions I stated. Unlike my originally provided method, I don't see an obvious way to adapt it if either of those assumptions fails to hold.
Recall that we deduced that the total number of eligible subset pairs was $3^n$. We can identify a bijection between each subset pair and $\{0,1,2\}^n$, i.e.: vectors of length $n$ with entries in $\{0,1,2\}$. Let $v$ be such a vector. If the $i^{th}$ entry of $v$ is $2$, then put $i$ in both sets; if it's $1$, then put it only in the bigger set; if it's $0$, ignore it.
So, the problem reduces to generating a number between $0$ and $3^n-1$ and examining its ternary representation.
First, we observe that there are $2^i{n\choose i}$ pairs of subsets where the bigger subset has cardinality $i$ and that there are a total of $\sum_i 2^i{n\choose i} = 3^n$ such pairs.
We proceed as follows:
1) We choose $k$ (between $0$ and $n$), the size of the bigger subset. This will not be a uniformly random choice but our initial observation lets us know the correct probabilities, i.e.: $P(k = i) = 2^i{n\choose i} / 3^n$.
2) We choose uniformly among the ${n \choose k}$ subsets of cardinality $k$ to be our bigger subset. Reservoir sampling should do the trick.
3) We choose uniformly among the $2^k$ subsets of the bigger subset to be the smaller subset. Simply choosing a random number from $0$ to $2^k-1$ and using the natural correspondence between sets and binary representations of integers should suffice.
EDITED TO ADD:
I just realized an even slicker way to do this but it depends heavily on the first two assumptions I stated. Unlike my originally provided method, I don't see an obvious way to adapt it if either of those assumptions fails to hold.
Recall that we deduced that the total number of eligible subset pairs was $3^n$. We can identify a bijection between each subset pair and $\{0,1,2\}^n$, i.e.: vectors of length $n$ with entries in $\{0,1,2\}$. Let $v$ be such a vector. If the $i^{th}$ entry of $v$ is $2$, then put $i$ in both sets; if it's $1$, then put it only in the bigger set; if it's $0$, ignore it.
So, the problem reduces to generating a number between $0$ and $3^n-1$ and examining its ternary representation.
Context
StackExchange Computer Science Q#77272, answer score: 6
Revisions (0)
No revisions yet.