patternMinor
Matching two people. One has 7% in common with the other. The other has 70% in common. What's a fair match score?
Viewed 0 times
thefairwhatwithmatchscorehasonetwopeople
Problem
I am exploring some ideas around dating site matching, and I can do with some ideas as to which is the better solution
Senario: We have a boy and girl, the boy has completed his profile/bio thoroughly while the girl hasn't or maybe she just doesn't like a lot of things, so this is what the data looks like
They share 7 things in common. To the girl the boy is a 70% match which is great
But to the boy the girl is only a 7% match
Maybe the girl would have more in common if she had more in her list of likes so maybe giving their match a 7/100 is a match missed and maybe that's all she likes and 7/100 is the best they'll ever match but even then, it still might be a good match because maybe the boy over filled his list of things he likes while the girl chose carefully what she really likes.
I am looking for ideas for an algorithm that would give the best match score to order them in lists with other people/matches who might have small percentage in matches because their lists of things they like are 1000s of entries long for example
Edit: I will not be matching strings, each of strings in the table above will have a unique ID in an RDBMS database. A table will be used to link users and topics
Senario: We have a boy and girl, the boy has completed his profile/bio thoroughly while the girl hasn't or maybe she just doesn't like a lot of things, so this is what the data looks like
Boy likes: Girl likes:
--------- -----------
Painting Painting
Movies Movies
Long walks Long walks
Harry Potter Harry Potter
Dogs Dogs
Pizza Pizza
Computer Science Computer Science
Mercedes Ford
Traveling Daisies
Diving Psycology
...
Total things
---------------------------------------
100 10They share 7 things in common. To the girl the boy is a 70% match which is great
But to the boy the girl is only a 7% match
Maybe the girl would have more in common if she had more in her list of likes so maybe giving their match a 7/100 is a match missed and maybe that's all she likes and 7/100 is the best they'll ever match but even then, it still might be a good match because maybe the boy over filled his list of things he likes while the girl chose carefully what she really likes.
I am looking for ideas for an algorithm that would give the best match score to order them in lists with other people/matches who might have small percentage in matches because their lists of things they like are 1000s of entries long for example
Edit: I will not be matching strings, each of strings in the table above will have a unique ID in an RDBMS database. A table will be used to link users and topics
Solution
It seems you're looking for a symmetric set similarity measure. (Symmetric since, as you point out, $A$ should match $B$ as much as $B$ matches $A$. Set similarity since each person's preference is defined by a set of objects.)
A number of these are used in the CS literature. Probably the most common is Jaccard similarity, defined by $|A\cap B|/|A\cup B|$. In your example, $A$ and $B$ have 7 elements in common, and there are 103 elements between them, leading to a similarity of $7/103$. A similar metric is Braun-Blanquet similarity, $|A\cap B|/\max\{|A|,|B|\}$; in your example this gives $7/100$. With these as starting points you may be able to find further metrics that capture what you want better. For example, here's a list I found while writing this.
What's notable about these metrics is that they don't take each person's dislikes into account---the similarity is defined only by what's on their list, not what's not on it. (So you may think it makes a better match if two people don't like fish.) Incorporating this would require a universe of elements (so any element in the universe not on the list is considered a dislike). In this case you can use the classic Hamming Distance. You have other options as well, like Euclidean distance or even the Sørensen-Dice similarity.
One benefit of using one of the above, well-known similarity measures is that you can use known algorithms to compute nearest neighbors quickly.
A number of these are used in the CS literature. Probably the most common is Jaccard similarity, defined by $|A\cap B|/|A\cup B|$. In your example, $A$ and $B$ have 7 elements in common, and there are 103 elements between them, leading to a similarity of $7/103$. A similar metric is Braun-Blanquet similarity, $|A\cap B|/\max\{|A|,|B|\}$; in your example this gives $7/100$. With these as starting points you may be able to find further metrics that capture what you want better. For example, here's a list I found while writing this.
What's notable about these metrics is that they don't take each person's dislikes into account---the similarity is defined only by what's on their list, not what's not on it. (So you may think it makes a better match if two people don't like fish.) Incorporating this would require a universe of elements (so any element in the universe not on the list is considered a dislike). In this case you can use the classic Hamming Distance. You have other options as well, like Euclidean distance or even the Sørensen-Dice similarity.
One benefit of using one of the above, well-known similarity measures is that you can use known algorithms to compute nearest neighbors quickly.
Context
StackExchange Computer Science Q#73709, answer score: 4
Revisions (0)
No revisions yet.