patternsqlModerate
What is Selectivity?
Viewed 0 times
selectivitywhatstackoverflow
Problem
The problem is I don't understand--no I DO understand selectivity--but I keep reading different definitions of it, which is confusing.
What I think (based on this by Gail Shaw): Selectivity is what predicates have. Columns aren't selective. Indexes aren't selective. Operators aren't selective. Selectivity is a measure of the percentage of rows that the predicate affects.
The confusion:
SQL Server Execution plans 3rd edition, Grant Fritchey. Page 223. He says
The selectivity of a predicate, for a given index, is the expected ratio of matching rows. Count the total number of rows in the table (z), count the number of distinct values (x) for a given column, or combination of columns, across all the rows, and then (x/z) gives the selectivity of the index, for an equality predicate comparing the column (or columns) against unknown values.
A highly selective index will have a low selectivity value. For example, a selectivity of
0.01 (1%) means that the optimizer expects 1% of the total rows in the table to match the predicate. Conversely, the worst possible selectivity is 1.0 (or 100%) meaning that every row will match the predicate condition.
Eh? I thought highly selective, i.e. 100%, i.e. 100% of the values are distinct, was a GOOD thing.
But he says 100% is the worst possible selectivity.
Then in this article, they calculate the selectivity of a column with 2 distinct values (gender) to be 0.02%. But 0.02% isn't good surely.
What I think (based on this by Gail Shaw): Selectivity is what predicates have. Columns aren't selective. Indexes aren't selective. Operators aren't selective. Selectivity is a measure of the percentage of rows that the predicate affects.
The confusion:
SQL Server Execution plans 3rd edition, Grant Fritchey. Page 223. He says
The selectivity of a predicate, for a given index, is the expected ratio of matching rows. Count the total number of rows in the table (z), count the number of distinct values (x) for a given column, or combination of columns, across all the rows, and then (x/z) gives the selectivity of the index, for an equality predicate comparing the column (or columns) against unknown values.
A highly selective index will have a low selectivity value. For example, a selectivity of
0.01 (1%) means that the optimizer expects 1% of the total rows in the table to match the predicate. Conversely, the worst possible selectivity is 1.0 (or 100%) meaning that every row will match the predicate condition.
Eh? I thought highly selective, i.e. 100%, i.e. 100% of the values are distinct, was a GOOD thing.
But he says 100% is the worst possible selectivity.
Then in this article, they calculate the selectivity of a column with 2 distinct values (gender) to be 0.02%. But 0.02% isn't good surely.
Solution
Selectivity is the property of a query predicate. One way to define selectivity is this:
Selectivity refers to the probability that any row will satisfy a predicate (that is, be true).
Thus
a selectivity of 0.01 (1%) for a predicate operating on a table with 1,000,000 rows means that the predicate returns an estimated 10,000 rows (1% of 1,000,000), and discards an estimated 990,000 rows.
High probability of a row to match the predicate means high value of selectivity.
Unfortunately, the choice of words is confusing, as you have found out:
A highly selective predicate (one with a selectivity of 0.10 or less) is desirable.
A "highly selective" ("very selective" might have been a better choice) predicate leads to a low selectivity value (i.e. low probability of a given row to match the predicate).
It appears that many people, including the authors of sources you quote, use the term "selectivity" to mean something entirely different. They often call it "index selectivity" and calculate it as the ratio of the index key cardinality (the number of distinct key values) to the underlying relation cardinality. Thus, the more distinct key values there are, the higher (closer to 1) the "selectivity", which is of course the inverse of the predicate selectivity for that combination of columns.
The confusion seems to be perpetrated by the SQL Server documentation, where it states (emphasis mine):
Density is information about the number of duplicates in a given column or combination of columns and it is calculated as 1/(number of distinct values). The Query Optimizer uses densities to enhance cardinality estimates for queries that return multiple columns from the same table or indexed view. As density decreases, selectivity of a value increases.
Taking into account that
Frequency is information about the occurrence of each distinct value in the first key column of the statistics object, and is calculated as row count * density. A maximum frequency of 1 can be found in columns with unique values.
we can see that value frequency = row count / (number of distinct values) and realize that the inverse of that Microsoft calls "value selectivity". Subsequently, the lower the number of distinct values, the higher the value frequency and the lower the value selectivity. This seems to be what your sources are talking about (though confusingly the first source mentions the predicate selectivity in the same sentence).
All that being said, when you are reading about "selectivity" you need to understand what kind of selectivity the particular author has in mind.
Selectivity refers to the probability that any row will satisfy a predicate (that is, be true).
Thus
a selectivity of 0.01 (1%) for a predicate operating on a table with 1,000,000 rows means that the predicate returns an estimated 10,000 rows (1% of 1,000,000), and discards an estimated 990,000 rows.
High probability of a row to match the predicate means high value of selectivity.
Unfortunately, the choice of words is confusing, as you have found out:
A highly selective predicate (one with a selectivity of 0.10 or less) is desirable.
A "highly selective" ("very selective" might have been a better choice) predicate leads to a low selectivity value (i.e. low probability of a given row to match the predicate).
It appears that many people, including the authors of sources you quote, use the term "selectivity" to mean something entirely different. They often call it "index selectivity" and calculate it as the ratio of the index key cardinality (the number of distinct key values) to the underlying relation cardinality. Thus, the more distinct key values there are, the higher (closer to 1) the "selectivity", which is of course the inverse of the predicate selectivity for that combination of columns.
The confusion seems to be perpetrated by the SQL Server documentation, where it states (emphasis mine):
Density is information about the number of duplicates in a given column or combination of columns and it is calculated as 1/(number of distinct values). The Query Optimizer uses densities to enhance cardinality estimates for queries that return multiple columns from the same table or indexed view. As density decreases, selectivity of a value increases.
Taking into account that
Frequency is information about the occurrence of each distinct value in the first key column of the statistics object, and is calculated as row count * density. A maximum frequency of 1 can be found in columns with unique values.
we can see that value frequency = row count / (number of distinct values) and realize that the inverse of that Microsoft calls "value selectivity". Subsequently, the lower the number of distinct values, the higher the value frequency and the lower the value selectivity. This seems to be what your sources are talking about (though confusingly the first source mentions the predicate selectivity in the same sentence).
All that being said, when you are reading about "selectivity" you need to understand what kind of selectivity the particular author has in mind.
Context
StackExchange Database Administrators Q#302287, answer score: 10
Revisions (0)
No revisions yet.