patternMinor
What would be the big O notation for a function that attempts to find pairs of users that participate in different discussions?
Viewed 0 times
findthewhatparticipatefunctionwouldbigdifferentnotationfor
Problem
Recently, I started to think of a problem involving the use of Twitter data that would involve finding pairs of users that participate in the most conversations. For example, with each tweet, you get the root conversation id along with a user id for that specific tweet. What I want to do is find pairs of users that participate in the same threads and also find the pair of users that have participated in the most threads (participation in a thread means that in one tweet, you would see a user id associated with the same conversation id as another user id.
This would involve first scanning every tweet (N) to fetch conversation, user id pairs. After doing this, I would have a list of user ids for each conversation id. I would then scan every user id association with each conversation id and then proceed to the next conversation id, scan each user id and compare against the previous list associated with previous conversation ids.
At the very least, I feel like the best method to do this would be, at a minimum, nlogn, but the method I'm proposing seems like it would make more comparisons than nlogn, but not as many as n^2.
How would I go about computing big O in this scenario?
Edit: I don't know if it is good form to edit one's question with a possible solution, but it appears you would need to scan each conversation (K) to get a list of tuples that associate users for that conversation (N^2-N)/2. You can then hash the tuple and increment the value of that dict key by one each time you discover the same tuple in another conversation. So ultimately you have to do (N^2-N) comparisons times K (each conversation). So the Big O would be larger than nlogn but not N^2. If I remember correctly, you drop the constants for big O so the running time would be somewhere around N^2-N * K (for each conversation). I'm not sure how to correctly form the big O here (what do you do with K?).
This would involve first scanning every tweet (N) to fetch conversation, user id pairs. After doing this, I would have a list of user ids for each conversation id. I would then scan every user id association with each conversation id and then proceed to the next conversation id, scan each user id and compare against the previous list associated with previous conversation ids.
At the very least, I feel like the best method to do this would be, at a minimum, nlogn, but the method I'm proposing seems like it would make more comparisons than nlogn, but not as many as n^2.
How would I go about computing big O in this scenario?
Edit: I don't know if it is good form to edit one's question with a possible solution, but it appears you would need to scan each conversation (K) to get a list of tuples that associate users for that conversation (N^2-N)/2. You can then hash the tuple and increment the value of that dict key by one each time you discover the same tuple in another conversation. So ultimately you have to do (N^2-N) comparisons times K (each conversation). So the Big O would be larger than nlogn but not N^2. If I remember correctly, you drop the constants for big O so the running time would be somewhere around N^2-N * K (for each conversation). I'm not sure how to correctly form the big O here (what do you do with K?).
Solution
It might be nice to model the problem as a bipartite graph $(U, C, E)$ where $U$ is the set of users, $C$ is the set of conversations, and $E \subseteq U \times C$ are edges for participation.
In this setting, for each user, the set of other users that participate in the same conversations are its neighbors-of-neighbors. Let's make a new graph where each user is connected to its neighbors-of-neighbors. In the new graph, give each edge a weight which is the number of unique conversations in which those two users participate. This can all be computed in O(|E|) because each edge in the new graph corresponds to two edges of the original.
In the new graph, look at the edge with the largest weight. The endpoints are the pair of users that converse the most with each other.
(I think this is approximately the algorithm that you had in mind. Using a bipartite graph makes things clearer.)
All together, that gives you O(|E|) for computing the result. I don't think there's any other algorithm that can do better than $O(|E|)$ in the worst case.
EDIT: this is too slow, because the new graph can be huge, much larger than |E|. See comments.
In this setting, for each user, the set of other users that participate in the same conversations are its neighbors-of-neighbors. Let's make a new graph where each user is connected to its neighbors-of-neighbors. In the new graph, give each edge a weight which is the number of unique conversations in which those two users participate. This can all be computed in O(|E|) because each edge in the new graph corresponds to two edges of the original.
In the new graph, look at the edge with the largest weight. The endpoints are the pair of users that converse the most with each other.
(I think this is approximately the algorithm that you had in mind. Using a bipartite graph makes things clearer.)
All together, that gives you O(|E|) for computing the result. I don't think there's any other algorithm that can do better than $O(|E|)$ in the worst case.
EDIT: this is too slow, because the new graph can be huge, much larger than |E|. See comments.
Context
StackExchange Computer Science Q#141403, answer score: 2
Revisions (0)
No revisions yet.